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1.  Introduction 


The  “dictionary  lookup”  stage  in  a  sopliisticatcd  natural-language  system  can  involve 
much  more  than  simple  information  retrieval.  In  text,  the  words  that  tiie  system  knows  may 
show  up  in  heavily  disguised  form.  Inflectional  endings  sucli  as  tense  mid  j)lural  niitrkings  may 
be  present;  the  addition  of  jirefixes  and  suffixes  may  change  part-of-speecli  and  meaning  in 
systematic  ways;  in  many  languages  words  may  have  unrelated  clitics  attaclied.  The  addition 
of  prefixes,  suffixes,  and  endings  is  often  accompanied  by  spelling  changes  as  well;  in  English, 
try+8  becomes  tries  and  dig+er  becomes  digger.  The  rules  of  spelling  change  can  be  rather 
complex. 

Superficially,  it  seems  tliat  word  recognition  might  potentially  be  complicated  and  dif¬ 
ficult.  This  paper  exainincs  tlu-  question  more  fornnilly  by  investigating  the  computational 
characteristics  of  the  “twfvh-vel'’  model  of  morphological  processes  (§2).  Given  the  kinds  of 
constriiints  that  can  be  encoded  in  the  model,  how  difficult  can  it  be  to  translate  between 
lexical  >uid  surface  forms?  Although  the  use  of  finite-state  machinery  in  the  twivlevcl  model 
gives  it  th<'  appearance  of  coniputational  efficiency,  the  model  itself  does  not  guarantee  ef¬ 
ficient  processing.  Taking  the  KiMMO  system  (Karttunen.  1983)  for  concreteness,  sections  4 
and  G  will  show  that  the  gener.d  problem  of  mapping  between  lexical  and  surface  forms  in  two- 
level  systems  is  computationally  difficult  in  the  worst  case.  If  null  characters  arc  excluded, 
the  problem  is  A'.P-complete.  If  null  characters  are  completely  unrestricted,  the  problem  is 
PSPACE-completc  and  thus  j)robably  even  harder  in  the  worst  case.  The  fundamental  diffi¬ 
culty  of  the  problems  does  not  seem  to  be  a  precompilation  effect  (§5). 

1.1.  Morphological  analysis  ■ 

The  Word-level  processing  carried  out  by  a  natural-language  system  is  formally  a  type  of 
morphological  analy.iin.  r{)iirerned  with  recovering  the  internal  structures  of  input  words.  For 
example,  singing  can  be  recogiiir.ed  as  jui  inffected  form  of  the  verb  sing,  while  unhappy 
can  be  analyzed  as  un+happy.  However,  the  morphological  component  c.mnot  break  words  up 
blindly;  d(  si)ite  appearances,  duckling  is  not  the  -ing  form  of  a  verb.  The  morphologicsil 
aiiidyzer  must  know  the  basic  words  of  the  language  in  addition  to  the  prefixes  <ind  suffixes.  In 
fact,  ,'uialysis  must  Ire  guided  by  more  specifii  constraints  as  well.  Not  evi  ry  word  ciin  combine 
with  every  affix;  it  would  be  <ui  error  to  analyze  unit  as  un+it  or  beer  as  be+er  (compare 
doer). 

The  nundrer  of  inflected  forms  of  a  given  word  is  smaller  in  English  than  in  many  other 
langtiages.  As  a  result,  for  a  sy.stein  with  small  scojre  it  often  suffices  to  trivi.dize  morphological 
iuiiilysis  by  hsting  all  inflect'  1  forms  in  tlie  .lictioinu'y  directly.  The  trivial  ajrproach  is  not 
feasible  for  heavily  inflected  languages  such  as  I'innish.  in  which  a  word  can  have  thousiuids 
of  possible  forms.  In  such  ca.ses.  both  practicality  and  elegance  require  a  more  systematic 
treat irn'iit  in  terms  of  inflectional  endings,  niocul  .uid  tense  markers,  clitics,  and  so  forth. 

The  jiroblem  of  recovering  the  int«>rnal  structur<‘s  of  words  ran  take  ;ui  extreme  form 
in  languages  that  iillow  i)roductive  compounding.  Kay  and  Kaplan  (1982)  illustrate  .such  a 
situation  with  the  German  word  LebeneverBicherungBgesellschaf tsangestellter,  which 
means  life  innuraTier  roriijiaity  etiiployir.  An  exhaustive  dictionary  is  imi)r.u'tiral  wlnui  such 
fn'e  compounding  is  [)ossibIe. 
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1.2.  Spelling  changes 

Bfsidcsi  knowing;  the  stems,  affixes,  and  co-occurrence  restrictions  of  a  language,  a  success¬ 
ful  in(>ri)hological  analyser  must  take  into  account  the  spelling  changes  that  often  accompany 
the  addition  of  suffixes  and  similar  elements.*  The  juogram  must  expect  love+ing  to  appear 
as  loving,  fly^s  as  flies,  lie+ing  as  lying,  and  big+er  as  bigger.  Its  knowledge  mast  be 
sufficiently  sophisticated  to  distinguish  such  surface  forms  as  hopped  (=  hop+ed)  and  hoped 
(=  hope+ed).  Cros.s-linguisticaUy,  si>clling-changc  processes  may  .span  either  a  limited  or  a 
more  extended  range  of  characters  (§1.2.1).  and  the  material  that  triggers  a  change  may  occur 
either  before  or  after  the  character  that  is  affected  (§1.2.2).  Complex  copying  processes  (§1.2.4) 
may  be  found  in  addition  to  simpler,  more  specific  changes. 

1.2.1.  Local  and  long-distance  processes 

TIk'  spelling  changes  associated  with  the  addition  of  English  suffixes  are  local  in  the  sense 
that  thc'V  do  not  affect  letters  far  away  from  the  word  suffix  boundary.  However,  there  are 
proces.-es  in  other  languages  that  operate  over  longer  distances.  The  si)elling  of  Turkish  suffixes 
is  systJ-matically  affected  by  vowel  harmony  proces.ses,  which  require  the  vowels  in  a  word  to 
agree  iti  certain  respects.^  The  vowels  that  appear  in  a  typical  suffix  arc  not  completely 
determined  by  the  suffix,  hut  arc  determined  in  part  by  the  rules  of  vowel  harmony.  The  suffix 
that  Underhill  (197G)  writes  as  -siniz  may  appear  in  an  actual  word  as  -siniz,  -sunuz,  - 
suniiz,  or  -siniz  depending  on  the  preceding  vowel.  Turkish  words  may  contain  large  numbers 
of  suffixes,  and  the  effects  of  vowel  harmony  can  propagate  for  long  distances.  (Hungarian 
suffixes  display  similar  changes.) 

1.2.2.  Left  and  right  context 

Local  spelling  changes  often  depend  on  right  context  as  well  as  left  context;  for  instance, 
carry+ed  changes  y  to  i  but  carry+ing  retains  y.  Less  commonly,  long-distance  changes  can 
also  be  triggered  by  material  to  the  right. ^  Verb  stems  in  the  Australian  language  Wiirlpiri 
disiilay  a  regressive  change  of  i  to  u  trigger<*d  by  a  tense  suffix  containing  a  nasal  u;  thus  the 
imperative  form  of  throw  is  kiji-ka,  but  the  past-tense  form  is  kuju-rnu  (Nash,  1980:84). 
As  illustrated,  this  iiarmoiiy  process  cjin  affect  more  th;m  one  i  in  the  verb  stem.  It  can  also 
propagate  through  the  element  -rni  that  can  appear  between  the  verb  stem  and  the  tense 

'  SpcIliiiK-cliiUiRr  proecssrs  actiKolly  rrprcsriit  a  snpcrBrial  ainalRHiii  nf  phonological  changes  and  ortho- 
gr.iphic  conventions.  In  this  paper,  these  two  aspects  of  .spelling  clninges  will  not  he  dhstingtiished.  The 
phonology  .nnl  tiie  orthogr.nphy  of  a  language  ilo  not  have  the  saine  st.iln.s  for  lingni.stics,  but  the  ditferences 
.ire  not  relevant  for  present  piir]iusc«.  Note  alwi  that  it  ia  the  anrfare  9]>elliug  of  a  word  that  will  be  presented 
to  .1  program  that  analyzes  written  text. 

''For  detail!)  of  this  jiruccsi),  iiee  Underhill  (1970),  Clements  and  Sezer  (1082),  and  numerous  references  cited 
tlnTein. 

N’.iny  i  iirri'iit  iuialysi's  of  vowel  hiirnioiiy  take  it  to  be  n  fuinl.'unentally  noinlirectionnl  process,  even  in 
laiiguiigi's  in  which  it  iilways  ;ippears  to  op)'rate  from  left  to  right.  F)>r  example,  it  appears  as  though  the 
n.llnenei.  of  root  vow>  ls  on  allix  vowels  always  i>r)«ee)ls  from  h-ft  to  right  in  Turkish,  but  this  is  because 
Turkish  la<  ks  pri  Hxes  (  'li  ini'iits  luid  Si'zi  r  (1982:2’lClf )  iliscnss  a  process  of  colloipiial  Turkish  in  which  a 
vowi  I  is  insi-rted  hi  'ween  th)'  initial  h-tti'rs  of  certain  worils.  The  I'lioii'e  of  vowi'l  is  di-ti-rinineil  by  the  nsn.'il 
harmony  rules  of  Turkish,  hut  ojier.ating  from  right  to  left  in  this  ciusc.  See  .also  Poser  (1982). 
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ending.  (Weirlpiri  also  has  another  long-distjuicc  harniony  prorcss,  which  operates  from  right 
to  left.) 

Other  languages  provide  further  example's  of  long-distanrc'  changes  that  are  conditioned 
by  material  to  the  right.  Kay  and  Kaplan  (1982)  mc'ntion  a  vo\vel-charige  process  in  Icelandic 
that  causes  vowels  in  the  middle  of  a  word  tc}  depend  on  tlie  vowel.'*  in  a  following  suffix.  The 
inflectional  system  of  German  also  involves  vowel  changes.  I'o.ser  (1982:131fT)  discusses  an 
extreme  example  of  long-distance  right-to-left  liarmony  that  occurs  in  the  language  Chumash. 
The  process  that  he  describes  changes  s  to  B  throughout  the  oitire  word  '.vhen  iin  S  occurs  in 
asuffix;  thus  s+lu+sisin+waB  ( S -h all + grow  awry+paat)  becomes  BluBiBinwaS  [it  is  all  grown 
awry). 

1.2.3.  Right  context  and  processing  ambiguity 

The  existence  of  changes  that  depend  on  right  context  implies  that  the  lexical-surface 
correspondence  for  a  particul;  character  cannot  iilways  be  (h'lerniined  when  the  character  is 
first  seen  in  a  Icft-to-right  scan.  However,  right  context  is  not  crucial  for  the  occurrence  of 
this  difficulty.  The  same  kind  of  local  ambiguity  can  arise  eveii  when  spelling  changes  do  not 
depend  on  right  context. 

Suppose  we  wore  to  remove  the  dependence  of  the  y-to-i  change  on  right  context  by  con- 
.sidcring  a  rule  .system  in  which  y  always  changes  to  i  after  p.^  There  could  still  be  uncertainty 
about  how  analysis  should  proceed.  A  surface  string  beginning  spi. . .  could  correspond  to  a 
lexical  string  spy. . .  as  in  spies,  but  it  could  eqiially  well  correspond  to  spi. . .  as  in  spider 
or  spiel.  In  gcnenil.  analysis  may  proceed  several  characters  beyond  a  choice  point  before  it 
becomes  apparent  which  choke  is  correct.  This  is  especially  true  with  a  large  system  vocabu¬ 
lary:  in  the  above  ex.imple.  a  system  that  did  not  know  any  spi. . .  words  could  immediately 
rule  out  spi. . .  in  favor  of  spy. . .,  but  a  system  with  more  complete  coverage  would  have  to 
look  further  into  the  input  before  it  could  identify  the  correct  choice. 

1.2.4.  Reduplication 

Some  language's  display  a  kind  of  change  caUed  reduplication  that  often  does  not  lend 
itself  to  analysis  by  the  kind.**  of  mechani.sms  that  are  apj)roprial<'  for  the  other  processes  that 
have  been  mentioned  liere.  neduplication  proce.sses  involve  the  co]>ying  of  consonants,  vowels, 
.syllables,  roots,  or  otln-r  subunits  of  wor<ls.  Na.-'ii  ( 1980:  J3Gff )  de.'icribes  a  redujilication  process 
in  Wiulpiri  that  cojjies  the  first  two  .syllables  of  a  verb  ;uid  has  various  semantic  effects.  For 
example,  he  cites  the  sentence 

pirli  ka  parnta-parnta-rrl-nja-mpa  ya-ni 

hill  niMS  crouch- IlEDUP  lNF-;icross  go-NONPAST 

The  mountain  fTtrnde  in  a  .lerie.s  of  humpa. 

’If  y  iilways  cliiuic's  to  1  aftir  p,  wli.al  jnstiliral ion  iimhl  therr  !)i'  for  siiyiiiR  tliat  gpy  and  not  ipl  is  thr 
corrri  f  niidrrlyiiii?  form'  In  tills  Irivi.il  roiistrnrtcd  cxaiiiplc.  tin  re  is  noni  .  In  an  iii  tna)  laiiqnat;<'.  tliiTf  could 
lie  cvidcin  1  from  a  Viiricty  of  sources:  siiflixcs  licciiiiiiiii:  with  y:  li.irmony  proccs-i  s:  rules  that  crc.itc  or  destroy 
the  p  that  triuRcrs  the  rliinii;c;  rules  tli.it  iire  triUKcred  hy  the  y  hefore  it  ili.ini’ci;  ainl  .so  forth. 


3 


in  whirh  tho  verb  stem  parntarri-  has  uinlcrRone  reduplication.®  Lieber’s  (1980:234ff)  dis¬ 
cussion  of  several  reduplication  proce.sses  in  the  lajiguage  Tagulog  provides  other  examples. 
One  Tagidog  reduplication  process  copies  the  first  consonant  and  vowel  of  the  stem,  mahing 
the  copied  vt)wel  sliort;  another  is  similar,  but  niak{»  the  copied  vowel  long;  a  third  process 
copies  the  first  syllable  and  part  or  all  of  the  second,  lengthening  the  copied  vowel  of  the  second 
syllable.  See  also  McCarthy’s  (1982;193f)  treatment  of  reduplication  in  Classical  Arabic.® 


’’Till-  iJN-nliciiH  in  flv'  Warljiiri  rxaniplen  arc  inurrU-d  «*  an  analytica]  aid  for  thr  reader,  and  do  not  conform 
to  I  he  nt^Uidard  ortliojo-apliy  (Hale,  1082:232), 

'^Metairtliy'H  triatiiieiif  of  Arabic  is  of  tbrorcfiral  interest  for  at  least  two  reasoiis:  it  helps  illuminate  the 
iianirc  of  liiiKiiistir  representations,  fuiil  it  shows  a  w.-\y  to  derive  iiiiuiy  clniracteristies  of  Arabic  reduplication 
from  imiversiil  liiiKiiistic  principles  rather  than  laiiRiiage-particnlar  stipulations. 
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2.  Two-Level  Morphology 


Given  a  description  of  the  root  forms,  the  combinatory  patterns,  and  the  spelling-change 
rules  of  a  language,  tlie  morpliological  analysis  task  is  well-defined  in  an  abstract  sense.  How¬ 
ever,  a  practical  morphological  analyzer  also  needs  an  efficient  way  of  putting  its  linguistic 
knowledge  to  use  in  actual  processing.  The  KiMMO  .system  de.scribcd  by  Karttunen  (1983) 
is  attractive  for  this  purpose.  KiMMO  is  an  implementation  of  the  “two-level”  model  of  mor¬ 
phological  analysis  that  Kimino  Koskenniemi  proposed  and  developed  in  his  Ph.D.  thesis.^ 
Spelling-change  rules  are  encoded  in  a  finite-state  automaton  component,  while  roots  and  af¬ 
fixes  are  listed  with  their  co-occurrence  re.strictions  in  a  dictionary  component.  The  focus 
here  is  on  the  automaton  component.  (Redujilicalion  processes  find  no  easy  treatment  in  the 
KiMMO  system,  and  will  henceforth  be  ignored.) 

2.1.  The  Automaton  Component 

The  two-level  model  is  concerned  with  the  representation  of  a  word  at  two  distinct  levels, 
the  lexical  or  dictionary  level  and  the  surface  level.  At  the  .surface  level,  words  arc  represented 
as  they  might  show  up  in  text.  At  the  lexical  level,  words  consist  of  sequences  of  stems, 
affixes,  diacritics,  tunl  boundary  markers  that  have  been  pasted  together  without  spelling 
changes.  Thus  Karttunen  and  Wittenburg  (1983)  represent  the  surface  form  tries  as  trjr+s 
at  the  lexical  level.  Siiiiihirly,  the  Warlpiri  surface  fornt  kijika  might  be  represented  at  the 
lexical  level  as  kl}l-ka,  where  I  is  a  special  lexical  character  tiiat  can  surface  as  cither  i  or 
u  according  to  h^lrmo^y  rules. 

2.1.1.  Expressing  Spelling  Changes  as  Two-Level  Automata 

A  spelling-change  nile  in  the  two-level  model  is  expressed  as  a  constraint  on  the  corre¬ 
spondence  between  lexical  and  surface  strings.  For  example,  consider  a  simplified  “Y-Change” 
process  that  changes  y  to  i  before  adding  es.  Y-Change  ran  be  expressed  in  the  two-level 
model  as  a  constraint  on  the  appearanre  of  the  lexical  surface  pairs  y/y  and  y/i.  Lexical  y 
nmst  correspond  to  surface  i  rather  than  surface  y  when  it  occurs  before  lexical  +8.  which  will 
iti  elf  come  out  as  surface  ea  due  to  the  operation  of  other  constraints. 

Each  constraint  is  encoded  as  a  finite-state  machine  witli  two  s<  anidug  heads  that  move 
along  the  lexical  and  surface  strings  in  panilh-l.  The  m.ichine  st.irt.-  out  in  state  1.  iuid  at  each 
step  of  its  operation,  it  changes  stat<’  based  on  its  current  stati  and  the  p.iir  of  characters  it 
is  scanning.  The  automaton  that  encodes  the  Y-(’hange  constraint  would  bt'  described  by  the 

^University  of  Hel.sinki.  Finland,  cirou  I'.dl  1083. 
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following  state  table: 


"Y-Change"  B  6 


y 

y 

♦ 

B 

= 

(lexical  characters) 

i 

y 

= 

8 

= 

(surface  characters) 

ttate  1 : 

2 

4 

1 

1 

1 

(normal  state) 

state  2. 

0 

0 

3 

0 

0 

(require  *a) 

state  3. 

0 

0 

0 

1 

0 

(require  a) 

state  4: 

2 

4 

6 

1 

1 

(forbid  *a) 

state  5: 

2 

4 

1 

0 

1 

(forbid  a) 

In  this  notation,  taken  from  Kajttunen  (1983)  following  Koskenniemi,  =  is  a  certain  kind  of 
wildciud  character.  The  use  of  ;  rather  than  .  after  the  state-number  on  some  lines  indicates 
tliat  the  :  states  are  Jinal  ntate».  which  will  accept  end-of-input.  In  order  to  handle  insertion  or 
dc'letion.  it  is  also  possible  to  have  a  mill  character  0  on  one  side  of  a  pair,®  but  the  possibility 
of  nulls  win  not  be  given  full  consideration  until  section  6. 

In  processing  the  lexical-surface  string  pair  try^s/tries,  the  automaton  would  run 
through  the  state  sequence  1,1, 1,2, 3,1  and  accept  the  correspondence.  In  contrast,  with  the 
string  pair  try+s/tryes  it  would  block  on  s/s  after  the  state  sequence  1,1,1,4,B  because  the 
entry  for  s/s  in  state  B  is  zero.  With  the  pair  try/tri  it  would  not  block  with  any  scro 
entries,  but  would  still  reject  the  pair  because  it  would  end  up  in  state  2,  which  is  designated 
as  non- final. 

These  examples  illustrate  how  the  Y-Change  automaton  implements  dependence  on  the 
right  context  +8.  The  automaton  will  accept  cither  of  the  correspondences  y/i  and  y/y,  but 
if  it  proces.ses  the  y/i  correspondence,  it  will  enter  a  sequence  of  states  that  will  ultimately 
block  unless  the  y/i  pair  is  followed  by  the  appropriate  lexical  context  +b.  The  right  context 
for  a  vowel  harmony  process  might  seem  more  difficult  to  encode  because  it  may  be  necessary 
to  ignore  si'veral  intervening  consonants,  but  .such  a  situation  actually  presents  no  problem  at 
Jill.  An  automaton  state  can  easily  ignore  irrelevant  characters  by  looping  back  to  itself. 

2.1.2.  Multiple  Spelling-Change  Processes 

A  language  will  generally  exhibit  several  different  spelling-change  processes:  for  example, 
Kiirttunen  (1983:177)  nientions  th.at  Koskenniemi "s  imaJysis  of  Finnish  uses  21  rules.  By  wid 
large,  these  separate  processes  ran  be  encoded  a.s  separate  automata  in  the  KiMMO  system. 
In  artiiid  processing,  the  automata  that  express  Viirious  si>rlling-change  constraints  will  all 
inspiTf  the  lexical  surface  correspondence  in  par.dlel.  The  correspondence  will  be  accepted 
only  if  evi-ry  autonmton  accepts  it  —  that  is.  if  it  satisfies  every  constraint.®  Berau.se  the 
.infomata  are  eonneeted  in  parallel  rather  than  in  series,  there  are  no  “fer-ding”  relationships 
f)etween  two-level  automata,"’  Figure  1  illustrates  the  parallel  arrangement  of  the  KiMMO 

" Tin  .i<  t fi.il  KIMMO  nil  of  KcirttuiiMi  (1983)  ilorsi  not  allow  null  charaOrrH  at  the  lexiciU  level,  but  the 
orjiis‘?*io!j  iiiej^Hriifial  ( K.vrt t uiieii .  p-c.). 

'If  mill  ihiiTacIrr?"  Hr<’  .Ulowed.  the  iiiterpretatiou  of  every  roiiat raiiit"  takes  on  a  certain  subtlety. 

Se*  *»«•(  tinn  C. 

'  If  is  a  tliKtrefiral  rlaiin  nf  lh<-  two-level  frniiiewtirk  that  iiiternieiliate  leve  ls  of  representation  ;uk1  “feetling" 
n  lat ioiiships  are  not  in  t  i  sMary  that  two  level}*  milhce,  in  otlier  wortls.  Series  eonnertion  of  the  automata 


c 


.  t  r  y  +  8  .  . 


.tries.  . 


Fipiire  1:  The  autoin.itoii  foinpoiu’iit  of  tlio  KiMMO  systt  ni  con.si.sts  of  srvoriJ  two-hciidcd  finite-state 
automata  that  insi>ect  tlie  lexical -surface  corrospondence  in  j)iu-allel.  Each  automaton  imposes  some 
constraint  on  the  correspondence.  The  automata  move  together  from  left  to  right.  (From  Kart- 
tuuen,  1083:176.) 


automata.  A  set  of  several  automata  can  also  be  conipilcd  into  a  single  large  automaton  that 
will  run  faster  than  the  original  set,  though  its  size  may  be  prohibitive  (;176f). 

2.2.  The  Dictionary  Component 

The  dictionary  component  of  the  KlMMO  sy.stcni  is  divided  into  sections  called  lexicons, 
which  are  all  ultimately  reachable  from  .a  distinguished  root  lexicon.  In  tlie  dirtionary-level 
processing  for  words  such  <us  singing,  KiMMO  first  locates  tlie  lexical  form  sing  in  the  root 
lexicon.  The  mechanism  for  indicating  co-occurrence  restrictions  involves  listing  a  set  of  con¬ 
tinuation  lexicons  for  each  entry,  and  in  thi.s  eas»-  one  pii.-^sibility  will  be  a  lexicon  that  contains 
♦ing.  In  the  actual  operation  of  the  KiMMO  system,  dictionary  proce.ssing  is  efficiently  inter¬ 
leaved  with  the  operation  of  tlie  automata  in  such  a  way  tliat  the  two  eompoiients  mutually 
constrain  their  operations. 

The  continuation-class  nierhanisin  that  the  KlM.MO  dictionary  uses  to  eiieode  eo-orcnrrence 
restrictions  among  roots  and  affixes  fias  only  finite-state  power;  each  lexicon  corresponds  to  a 
state  in  a  transition  network.  As  many  people  liave  tiolired  (c.i/.  Kart t iiiien .  1983:180;  Kart- 
tuTien  and  Wittenburg.  1983:222f).  siieti  a  ilesign  ni.-ikes  it  diffirull  or  impossible  to  express 
some  nior|)liologiral  constraints.  In  tfie  future,  tlie  Kl.MMO  dictionary  component  will  almost 

wonld  iinj^ly  th<’  cxii^tciirf  of  r<*pr<  •'rnt.ti  io:i  lovclt*  at  thr  intt-rfat  i  h*  » w» »  u  .oitornata.  Dryoiul  the 

qnr*»t  ion  of  con i put  at  ifUial  ♦•fiicioury.  *  hr  t  Ir  on  t  it  .J  <  laitot*  of  th»  t  wo-|rv«  I  inoth  !  will  not  l>r  rvalnatcti  h«TC. 
l*o.><Hil»lr  .'irmwjMMit**  anaiii?*t  tlnin  rtrihi  iiiV«>ivc  (a)  ruli*  or«lrrit4»M  witli  thpMi  1.  fi  )  p.irticuliir  .uialyjJtt^  in 
wiiic'li  thr  availahility  <il  only  twft  Irvrl:*  tn  nduiKiaiM  V  in  thr  .aittijuat a.  aial  '  iiMil’i-part  altrrnative 

rt'prr:»t  nt  ahons  I e.p.  troi 1 1  anl  <>?*«  th*  orv  i  i  hat  .ilhtw  .i  nmi<  ihaminat  siik:  i\«  : :  i .on  i»i  v.trion.'» 

j»rot  Oiif  poHT'iblr  aji’’inirnt  f.tr  tliriij  <<mhi  involvr  tfir  limit  iplnity  •>*.  ptt—il*  Ijt  it  -  ftr  rnlf  onit  ring  in  a 

iiiotU'l  with  intrrinodiatc  drri vat ional  !4rrpM. 
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certainly  be  redesigned. 

The  automaton  component  rather  than  the  dictionary  component  of  the  KiMMO  system  is 
the  m.iiii  objr'ct  of  attention  here,  and  little  more  will  be  said  about  the  dictioiiiiry  component 
until  section  7.1. 

2.3.  Generation  and  Recognition 

A  KiM.MO  system  does  not  particularly  lejin  toward  either  generating  or  recognizing  the 
word.s  of  a  language.  Since  the  machines  of  the  automaton  component  just  express  constraints 
on  [lermissible  lexical  surface  correspondences,  they  can  serve  equally  well  to  determine  the 
lexical  form  of  a  surface  word  (recognition)  or  to  map  a  lexical  stem  with  affixes  into  the 
jiroper  surface  form  (generation).  The  only  major  differimce  is  whether  the  proce.ss  is  driven 
})y  the  surface  or  lexical  form.  However,  the  recognition  algorithm  is  .slightly  more  complicated 
because  it  uses  the  lexicon  as  well  as  the  automata  to  constrain  the  iuialysis  of  an  input  word. 
(.■\s  Kiirttunen  (19811:184)  notes,  it  would  require  only  a  simi)le  change  to  run  the  recognirer 
without  the  constraints  of  the  stem  lexicon.  Such  a  mode  of  operation  would  be  useful  for 
stripping  recognizable  suffixes  from  unfamiliar  roots.) 


3.  The  Seeds  of  Complexity 
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i 


The  iiac  of  fiiiitr-stntc  niacliincry  givoa  tho  two-level  model  the  appearance  of  computa¬ 
tional  efficiency,  but  in  the  worat  caae  a  KlMMO  generator  or  recognizer  has  a  lot  of  work 
to  do.  This  aection  probes  possible  .sources  of  complexity,  while  the  next  section  will  exploit 
them  in  mathemadcal  reductions  that  luiawer  the  question  of  how  hard  KiMMO  generation 
and  recognition  can  be  in  the  general  case. 


3.1.  The  Lure  of  the  Finite-State 

At  first  glance,  the  KiMMO  system  rai.ses  hopes  of  tinhuling  efficiency.  Both  recognition 
and  generation  seem  to  be  a  Tnatter  of  stepping  Hnite-state  machines  through  the  input  from 
left  to  right,  a  process  that  takes  only  a  quick  array  reference  or  so  per  character.  Any 
nondeterininism  that  might  arise  causes  little  initial  concern,  since  methods  of  determinizing 
finite-state  machines  are  well-known.  Lexic.d  lookup  can  id.so  he  done  quickly,  character  by 
character,  interleaved  wit*'  the  speedy  left-to-right  progress  of  the  automata: 

It  is  a  common  technique  to  repnn^ent  lexicons  as  letter  trees  because  it  minimizes 
the  time  spent  on  searching  for  the  right  entry.  The  recognizer  only  nnikes  a  single 
left-to-right  pass  as  it  homes  in  on  its  target  in  the  lexicon.  (Kiirttunen,  1983:178) 

The  fundamental  efficiency  of  finite-state  machines  promises  to  make  the  speed  of  KIMMO 
processing  for  a  language  largely  independent  of  the  nature  of  the  constraints  that  the  automata 
encode: 

The  most  important  technir;d  feature'  of  Koskenniemi’s  and  our  implementation  of 
the  Two-level  model  is  that  morphologieal  ruh's  itre  rei)resented  in  the  processor  as 
automata,  more  sjiecificidly,  as  finite  state  transduce'rs  ....  Om-  importiuit  conse¬ 
quence  of  compiling  |the  grammar  rules  into  aeitomata]  is  that  the  complexity  of  the 
linguistic  descrifetion  of  a  language  ha.s  no  significant  effect  on  the  speed  at  which 
the  forms  of  that  lang\iage  ran  be  recognized  or  generated.  'I'hjs  is  due  to  the  fact 
that  finite  state  machines  are  vc-ry  fast  to  operate  because  of  their  .sirnijlicity  ....  Al¬ 
though  Finnish,  for  ex.uii[>le.  is  mor[>liologically  a  much  more  romi)lirated  l.anguage 
than  English,  there  is  no  diffc-rence  of  the  s.ime  magnitude  in  the  jtrocessing  times 
for  the  two  languages  ....  This  fact;  has  some  p.sy<holinguistir  interest  because  of 
the  common  sense  observii' ion  that  we  talk  about  “simple"  and  "coinplex’  langttages 
but  not  about  “fast"  ;uid  “slow"  ones.  (:lGCf) 

In  order  for  the  automaton-b;i.sed  two-level  model  to  bt'  of  [jsychohnguistic  interest  in  this 
way,  it  must  be  the  model  itself  th:it  wi|)es  out  i>rocessing  difficulty,  rather  tlnui  some  acci- 
dentiil  property  of  the  constraints  that  the  Jiutomata  encode.  In  much  the  same  vein,  Lind- 
stedt  (l'J81:171)  remarks  following  Koskenniemi  that  “it  is  i)sy<  holingttist ic.illy  interesting  to 
note  that  the  itwf)-level,  rules  .u'e  ecpiivtdent  to  such  coinputat  ion.dly  siniitle  and  elfective  'i.f  . 
efficient]  devices.”  again  ])icking  out  the  finite-stale  nnachinery  iis  the  factor  res])onsible  for 
coniitutatiomd  effici<  iicy. 
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3.2.  Sample  Recognizer  Behavior 


In  assessing  the  cuniputatioiial  charactcristirs  of  the  KiMMO  processing  algorithms,  it  is 
logical  to  begin  with  an  exanjple.  Figure  2  shows  the  operations  that  a  KiMMO  recogniter 
for  English  go<‘s  through  when  it  analyzes  the  word  ^piel.  From  inspecting  the  sequence  of 
lexical  forms  that  are  considered,  it  is  clear  that  the  recognizer  docs  more  than  just  gliding 
from  left  to  right  through  the  string. 

For  example,  at  step  7  the  recognizer  is  considering  the  lexical  string  spy*,  y  surfacing  as 
i  and  +  as  e,  under  the  theory  that  the  input  word  might  be  a  plural  form  of  the  noun  spy  — 
spies  or  spies  ’ ,  that  is.  At  step  9  that  analysis  has  liiiled  to  pan  out  and  spy+  is  considered 
again,  this  time  with  ♦  coming  out  null  mi  the  surfai'e  instead  of  matching  the  input  e.  At 
step  11  the  recognizer  has  dropped  back  to  the  form  spy  that  it  was  con-sidering  at  step  4,  this 
time  taking  the  root  as  a  vi-rb.  All  of  the  spy  possibilities  ultimately  fail,  and  at  step  52  the 
recognizer  finally  tries  spi  instead,  repudiating  the  incorrect  choice  that  it  made  in  step  3.  In 
step  53  it  assumes  that  the  e  in  the  lexical  form  spie.  .  .  might  have  been  deleted,  but  this 
idea  soon  founders.  Finally,  in  step  59  it  finds  the  correct  Icxictil  entry  spiel. 

3.3.  Sources  of  Runtime  Complexity 

Traces  of  recognizer  operation  reveal  several  factors  that  combine  to  determine  the  overall 
computational  diliiculty  of  an  analysis.  Tlie  recognizer  must  run  the  finite-state  machines  of 
the  automaton  component  and  descend  the  letter  trees  that  make  up  a  lexicon,  it  must  decide 
which  suffix  lexicon  to  explore  after  finding  a  root,  and  it  must  discover  the  correct  lexical- 
surface  correspondence. 

3.3.1.  Stepping  through  the  automata  and  the  lexicon 

First  of  all,  some  of  the  recognizer’s  activities  arc  concerned  with  the  meehauieal  operation 
of  the  automata  and  tlic  letter  trees  of  the  lexicon.  Running  the  automata  is  expected  to 
he  fijst;  there  arc  m;my  well-known  fa-st  implementations  of  finitc-stati'  machines,  differing 
somewhat  in  their  time  and  space  requirements.  Descending  a  letter  tree  should  also  be  easy, 
in  any  of  its  common  implementations. 

3.3.2.  Choosing  among  alternative  lexicons 

Second,  the  n'cognizer  often  makes  unfortunate  choices  about  the  path  that  it  should 
follow  through  the  collection  of  lexictuis  in  the  dictionary  component.  Quite  a  few  nodes  in 
the  .search  tree  of  Figure  2  represent  choices  among  alternative  lexicons  (LLL).  For  example, 
at  stej)  I]  the  reeognizcf  may  .s<'arch  any  of  several  lexicons  next:  the  lexicon  I  that  encodes 
the  fart  that  the  pres<'nt  indicative  of  a  verb  may  have  no  added  ending,  the  lexicon  AG  that 
confiiins  the  agentive  ending  ♦er.  or  one  of  s<'ver;il  other  Jexieons  that  contain  +ed  and  other 
iiillectional  endings. 

The  .-.earch  for  a  jiath  through  the  suffix  lexicons  of  the  dictionary  component  ran  tcikc 
eon.siilerabh'  'ime  in  tlie  rnrrent  KiMMO  iiiiph'inentation.  However,  such  wandering  ran  be 


Recognizing  surface  form  "spiel”. 


1 

s 

1.4. 1.2. 1.1 

2 

Sp 

1.1. 1.2. 1.1 

3 

spy 

1.3. 4, 3. 1.1 

4 

"spy"  ends. 

new  lexicon  M 

6 

"0"  ends. 

new  lexicon  Cl 

6 

spy 

XXX  extra  Input 

7 

(5) 

spy+ 

1.6.16.4,1,1 

B 

spy* 

XXX 

9 

(5) 

spy* 

1.6. 1.4. 1.1 

10 

spy* 

XXX 

11 

(4) 

"spy"  ends. 

new  lexicon  I 

12 

spy 

XXX  extra  Input 

13 

(4) 

"spy"  ends. 

new  lexicon  P3 

14 

spy+ 

1,6. 1.4. 1.1 

15 

spy+ 

XXX 

16 

(14) 

spy* 

1.5.16.4.1.1 

17 

spy+ 

XXX 

18 

(4) 

"spy"  ends. 

new  lexicon  PS 

19 

spy+ 

1.6. 1.4. 1.1 

20 

spy+e 

1.1. 1.1. 4.1 

21 

spy+e 

XXX 

22 

(20) 

spy+e 

1.1. 4. 1.3.1 

23 

spy+e 

XXX 

24 

(19) 

spy+ 

1.5,16.4.1.1 

25 

spy+e 

XXX  Epenthesis 

26 

(4) 

“spy"  ends. 

new  lexicon  PP 

27 

spy+ 

1.6. 1.4. 1.1 

28 

spy+e 

1.1. 1.1. 4.1 

29 

spy+e 

XXX 

30 

(28) 

spy+e 

1.1, 4. 1.3.1 

31 

spy+e 

XXX 

32 

(27) 

spy+ 

1.6,16.4.1.1 

33 

spy+e 

XXX  Epenthesis 

34 

(4) 

"spy"  ends. 

new  lexicon  PR 

36 

spy+ 

1.6. 1,4, 1.1 

36 

spy+ 

XXX 

37 

(36) 

>py+ 

1.6.16,4.1.1 

38 

spy+ 

XXX 

39 

(4) 

"spy"  ends. 

new  lexicon  AG 

40 

spy+ 

1.6. 1.4, 1,1 

41 

spy+e 

1.1. 1.1. 4.1 

42 

spy+e 

XXX 

43 

(41) 

spy+e 

1.1. 4. 1.3,1 

44 

spy+e 

XXX 

45 

(40) 

spy+ 

1.6,16.4.1.1 

46 

spy+e 

XXX  Epenthesis 

47 

(4) 

"spy"  ends, 

new  lexicon  AB 

46 

spy+ 

1.6. 1.4. 1.1 

49 

spy+ 

XXX 

50 

(48) 

spy  + 

1.6.16,4.1,1 

51 

spy+ 

XXX 

52 

(3) 

spl 

1.1.4. 1.2. 5 

53 

sple 

1,1.16,1.6.1 

54 

sple 

XXX 

55 

(63) 

sple 

1.1.16.1.5.6 

56 

spiel 

1.1.16.2.1.1 

57 

"spiel"  ends. 

new  lexicon  N 

58 

"0"  ends. 

new  lexicon  Cl 

59 

"spiel " 

•**  result 

60 

(58) 

spiel* 

1,1.16.1,1.1 

61 

spiel + 

XXX 

((" 

spiel 

"  (N  SG))) 

I 

---♦*xx+ 

I 

-♦xxx* 

LLL+lII* 

I 

LLL+---+XXX+ 

I 

-- -+XXX+ 

LLL+---*---+XXX+ 

I 

---♦■XXX  + 

- ♦AAA* 

llL*---'<----rXXX  + 

I 

---♦xxx» 

---♦AAA+ 

LLLr---*XXX* 

I 

---+XXX+ 

LU4.---  +  ..-4.XXX  + 

I 

'--♦XXX+ 

*" -+AAA+ 
LLL+---+XXX+ 

I 

---♦XXX+ 

---♦---♦XXX+ 

.l.  +  ...  +  t,Ll  +  LLL  +  ***  + 
---♦XXX+ 

Key  to  tree  nodes; 

---  normal  traversal 

LIL  new  lexicon 

AAA  blocking  by  automata 

XXX  no  lex ical -surf ace  pairs 
compatible  with  surface 
char  and  dictionary 
Ill  blocking  by  leftover  Input 

•••  analysis  found 


Figure  2:  These  traces  show  tlie  steps  that  the  KiMMO  rerogiiixer  for  Biiglisli  goes  tiiroiigh  while 
analyzing  thi’  surface  form  spiel.  Each  hue  of  the  table  on  the  left  sliows  the  lexical  string  and 
automaton  states  at  the  end  of  a  .s1ej>.  If  some  autom.iton  blocked,  the  automaton  stati’s  are  replaced 
by  :ui  XXX  eiitry.  An  XXX  entry  with  no  automaton  iiiuiie  indicates  that  thi’  lexical  string  could  not 
be  extended  bec.ause  the  stirhice  character  :uid  lexical  letter  tree  together  ruleil  out  all  feasible  pairs. 
After  an  XXX  or  entry,  the  recognizer  backtr.acks  and  picks  up  from  a  previous  cluhce  point, 

indicated  by  the  parenthesized  step  number  before  the  lexical  string,  Tlie  tree  on  tlic  right  depicts 
the  search  graphically,  reading  from  left  to  right  and  top  to  bottom  with  vcrtic.d  biirs  linking  the 
choices  at  ea<  h  choice  point.  The  ligures  were  geiierateii  with  a  KiMMii  implement ;it ion  written  in  an 
augmented  version  of  MACl.lSPbased  initiaily  on  Karttunen  s  (  I'JHIi:  IS'JII  )  algorithm  description;  the 
dictionary  and  automaton  coniponents  for  English  were  taken  Irom  Kar 1 1 iiiicn  and  V\  it ti'iibiirg  ( 1983) 
with  minor  ch.uiges.  This  hnpleiiieiitatioii  se.arches  deplh-lirst  as  Karttniien  s  dois.  |)ut  explores  the 
.'ilternatives  at  a  given  depth  in  a  different  order  from  Karttunen  s. 
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Recognizing  surface  form  "spiel*. 

1 

s 

1.4. 1.2. 1,1 

2 

sp 

1.1. 1.2. 1.1 

3 

spy 

1.3. 4. 3. 1.1 

4 

"spy"  ends. 

new  lexicon  (/N) 

6 

"0"  ends. 

new  lexicon  (Cl) 

6 

spy 

XXX  extra  input 

7 

(5) 

spy+ 

1.5,16.4,1,1 

a 

spy+ 

XXX 

9 

(5) 

spy+ 

1.6. 1.4, 1.1 

10 

spy* 

XXX 

11 

(4) 

"spy"  ends. 

new  lexicon  (/V) 

12 

spy 

XXX  extra  Input 

13 

+ 

spy+ 

1.6. 1.4. 1.1 

14 

spy+e 

1.1.1. 1.4.1 

15 

spy+e 

XXX 

16 

(14) 

spy+e 

1.1. 4. 1,3.1 

17 

spy+e 

XXX 

18 

(12) 

spy+ 

1.5,16.4.1.1 

19 

spy+e 

XXX  Epenthests 

20 

(3) 

spi 

1.1. 4, 1.2. 6 

21 

spie 

1.1.16.1.6.1 

22 

spie 

XXX 

23 

(21) 

spie 

1,1.16,1.5.6 

24 

spiel 

1,1.16.2,1.1 

25 

"spiel"  ends. 

new  lexicon  (/N) 

26 

"0"  ends, 

new  lexicon  (Cl) 

27 

"spiel" 

•••  result 

28 

(26) 

spiel+ 

1,1.16.1,1,1 

29 

spiel+ 

XXX 

-+---+LLL+LLL+1II+ 

I 

---+XXX+ 
---+XXX+ 

lLl+iii+ 

---+---+XXX+ 

I 

— -+KXX+ 
---+AAA+ 


-+---+XXX+ 

I 


--+---+LLL+LLL+»*»+ 
---+XXX+ 


{("spiel-  (M  SG))) 

Figure  3:  The  dictionary  modification  that  will  be  described  in  section  7.1  causes  the  KiMMO  rec¬ 
ognizer  to  make  fewer  choices  among  lexicons.  These  traces  show  the  steps  that  the  r«’rogni*cr  goes 
tlirough  in  the  analysis  of  spiel  when  the  tnerged  dictionary  is  used;  the  number  of  lexicon-choice 
nodes  (LLL)  is  lower  than  in  Figure  2.  The  names  of  the  merged  lexicons  are  written  in  parenthe¬ 
sized  form  to  indicate  that  each  one  acttially  represents  a  class  of  lexicons  in  the  original  dictionary 
description.  A  ♦  entry  in  the  backtracking  column  indicates  backtracking  from  an  inunediate  failure 
in  the  previous  step,  which  does  not  reqjiire  the  full  backtracking  mechanism  to  be  invoked. 


sharply  reduced  by  merging  the  lexicons  in  such  a  way  that  several  lexicons  can  be  searched 
in  i)ara]lel;  section  7.1  will  explain  in  detail.  Meanwhile,  taking  this  improvement  for  granted 
will  niJike  it  possible  to  sidestep  the  problem  and  focus  on  other  proriHJsrs.  With  the  merged 
dictioiiiiry.  Figure  3  shows  that  the  number  of  lexicon  choice  alternatives  in  the  search  tree  for 
spiel  is  reduced  from  8  to  2.**  cutting  the  total  number  of  steps  from  61  to  29.  (The  choke 
between  spy  -noun  and  spy-verb  remains  because  it  would  be  directly  retieeted  in  the  output, 
but  the  ])urcly  internal  choices  among  the  lexicons  for  different  verbal  endings  are  eliminated.) 

3.3.3.  Finding  the  lexical-surface  correspondence 

Finally,  some  of  the  backtracking  results  from  local  iunbiguity  in  the  eonstniction  of  the 
IfTtcal  surface  correspondence.  Even  if  only  one  }M>ssibilily  i.s  glob.'illy  roinpatible  with  the 
roust  rfuiits  imposed  by  the  lexicon  and  the  automata,  there  may  not  b«'  enough  evidence  at 
ev(  ry  point  in  processing  to  choose  the  eornvt  h-xieal  surface  pair;  .s<'arrh  l«-havior  results. 

'  'Tlirsr  figiiri'!)  count  LLL  nodrs  excluding  uiiiuiibiguous  choices. 


%.  f.  ..-h 


12 


s/s 


p/p 


y/1  (/N 

. tLLL 


LLL+luLLL  +  IIinn* 


I 

+/• 


+/0 


-+XXXXXXX+ 


-♦XXXXXXX+ 


4d 


e/0 


-+XXXXXXX't 


e/e 


-+XXXXXXX+ 


+/e  e/0 

--^AAAAAAAt 


1/1 


e/0 


-♦XXXXXXXf 


e/e  1/1  (/N)  (Cl) 

. ♦ . 


Jo 


-tXXXXXXX* 


(("spiel"  (N  SG))) 

Figure  4:  This  expanded  version  of  the  search  tree  from  Figure  3  shows  what  hypothesis  the  KiMMO 
recognizer  is  entertaining  along  each  path,  during  the  analysis  of  spiel  with  a  merged  dictionary. 


Figure  4  displays  the  search  graphically  with  an  expanded  version  of  the  mcrged-lcxicon  search 
tree  from  Figure  3,  annotated  with  information  about  the  specific  choices  the  recognizer  has 
at  each  point. 

Thus,  after  seeing  the  surface  characters  spi .  .  . ,  the  recognizer  did  not  have  enough 
evidence  to  choose  between  the  lexical  possibilities  spy .  .  .  and  spi .  .  . ,  even  though  only 
one  analysis  was  po.ssible  for  the  complete  input  spiel.  During  exploration  of  the  spy. . . 
possibility  in  the  (/V)  lexicon,  there  weis  uncertainty  about  the  pairs  +/0,  +/e,  e/0,  and 
e/e.  It  proved  unprofitable  to  explore  those  regions  of  the  tree  in  the  analy.sis  of  spiel,  but 
Figures  5  and  6  show  that  the  correct  analysis  can  lie  in  those  regions  for  other  words. 

Similarly,  in  analyzing  the  word  rubbish  (Figure  7),  the  recognizer  Ciumot  tell  after 
seeing  only  rubb.  .  .  whether  the  lexical  string  is  rubb.  .  .  as  in  rubbish  or  rub+.  .  .  as  in 
rub+ing  ==>  rubbing.  In  fact,  it  briefly  considers  the  possibility  that  surface  r.  .  .  might 
correspond  to  lexical  re‘  .  .  .  as  in  the  stress-iimrked  lexical  representation  re'fer,  but  it 
quickly  discovers  that  the  right  context  for  licensing  the  e/0  pair  is  absent.  (Recall  from 
section  2.1.1  how  a  KiMMO  automaton  iniph'ments  a  change  that  depends  on  right  context: 
initiaUy  it  permits  the  changed  jiair  in  the  exjx-ctation  that  tin’  projier  right  context  will  be 
found,  juid  upon  jirocessing  tin-  changed  pair,  it  enters  a  state-sequence  that  wiU  eventually 
block  without  the  neci-ssary  right  context.) 

In  thesi-  ca.s<'s.  misguided  .search  subtrees  did  not  get  very  deep  -  largely  because  the 
ri’levant  spelling-change  proress<-s  were  local  in  <'haracter.  Long-distance  harmony  processes 
are  also  possible  (§1.2).  ;uid  thus  there  can  i>otentially  be  a  long  intervid  before  the  acceptability 
of  a  lexical  siirf.ice  jiair  is  idtimalely  detc-rinined.  For  example,  when  vowel  .dteriiations  within 
a  verb  stem  iue  coiidif  ioned  by  the  occurrence  <if  jiarticuhu-  tense  sullixes.  it  may  be  necessary 


1.3 


s/s 


p/p 


y/1  (/«)  (Cl) 

. +  LLLLLLl+lLLLLLL+niIIII* 


. tXXXXXXX* 

I 

♦/O 

. ♦XXXXXXX* 

LlU^LL+lIIini* 

I 

*/0  »/0 

. + . tXXXXXXX* 

I  I 

e/e  d/d 


I*/e  e/0 

. ♦AAAAAAA+ 

1/1  e/0 

. ♦ . tXXXXXXXi- 

I 

*/e 

. +XXXXXXX+ 

(("spy+ed"  (V  PAST  PRT))  ("spy+ed”  (V  PAST))) 

Figure  5:  The  search  tree  for  spied  is  similar  to  the  search  tree  for  spiel  (Figure  4),  but  the  solution 
lies  in  a  different  region  of  the  tree.  Neither  part  of  the  search  can  be  eliminated,  since  either  one  may 
contain  the  solution. 


s/s 


p/p 


y/1  (/M 

. +111 


lit 


♦luLiL+niTin* 


*/e 


s/s 


I 

*/0 


-tXXXXXXX+ 


(/V) 

LILIILL+Iinill  + 
I 

*/0 


e/0 


I 

e/e 


fXXXXXXX*^ 


-♦XXXXXXX+ 


*/e  e/0 

. +AAAAAAA+ 


s/s 


1/1 


e/0 


-*XKXXKXX+ 


e/e 


-*xxxxxxx* 


(("spy<-s’  (V  PRfS  S6  3R0))  ("spy+s"  (N  PL))) 

Figure  0:  hi  the  analysis  of  spies,  the  loratioii  of  the  solution  in  the  search  tree  is  different  from  its 
loc.ation  for  spiel  (Figure  4)  or  spied  (Figure  5).  Thus  none  of  the  three  main  regions  of  the  tree 
can  be  ]>ruued  from  the  search. 


Recognizing  surface  form  "rubbish". 

1  r  1.1. 1.2. 1.1  12  +  rub+1  1.1.1. 1.2.6 

2  re  1.1. 1.1. 4.1  13  rub+1  XXX 

3  re'  XXX  Elision  14  (6)  rubb  1.1.16.2.1.1 

4  (2)  ru  1.1. 4. 1.2.1  16  rubbi  1.1.16.1.2.6 

6  rub  1.1, 6. 2. 1.1  16  rubbis  1,4.16.2,1.1 

6  "rub"  ends,  new  lexicon  (/V)  17  rubbish  1,3,16,2,1,1 

7  rub  XXX  extra  input  18  “rubbish"  ends,  new  lexicon  (/N) 

8  +  rub+  1,1, 3, 1,1.1  19  "0"  ends,  new  lexicon  (Cl) 

9  rub+e  XXX  Gemination  20  "rubbish"  •••  result 

10  (7)  rub+  1.1. 2. 1,1.1  21  (19)  rubbish+  1,6,16.1.1.1 

11  rub+e  XXX  Gemination  22  rubbish+  XXX 

(("rubbish"  (N  SG))) 

Figure  7:  While  an.ilyzing  the  surface  form  rubbish,  the  KiM  MO  recognizer  is  teinpor.arily  misled 
(i)  hy  the  possil)ility  that  a  lexical  e‘  might  have  I)e<'n  (leleted  at  tlu'  surface  .aid  (ii)  hy  the  possibility 
th.at  the  surface  bb  might  have  resulted  from  doubling  of  a  single  iin.lerlyiiig  b.  However,  in  each  case 
the  possibility  fails  to  pan  out.  (Refer  to  Figure  2  for  iui  explanation  of  the  table  format.) 


to  see  the  end  of  the  word  before  making  final  deci.sions  about  tlie  stem.**  The  possibility  of 
a  long  period  of  uncertainty  forms  the  basis  for  the  reductions  in  section  4. 

3.4.  Search  and  Verification 

Setting  aside  until  section  7.1  the  problem  of  choosing  among  alternative  lexicons,  it  is 
easy  to  see  that  the  use  of  finite-state  machinery  helps  control  only  one  of  the  two  remaining 
sources  of  complexity.  Stepping  the  automata  should  be  fast,  but  the  finite-state  framework 
docs  not  guarantee  speed  in  the  task  of  guessing  the  correct  lexical  -surface  correspondence. 
The  search  required  to  find  the  correspondence  may  predominate. 

In  fact,  the  KiMMO  recognition  and  generation  problems  hoar  an  ominous  resemblance 
to  problems  in  the  computational  class  P .  N  P  consists  of  the  juoblenis  that  can  be  solved 
on  a  //ondeterministic  Turing  machine  within  Polynomial  time.  Informally,  a  problem  in  A/P 
has  a  solution  that  may  be  hard  to  gueaa  (hence  the  use  of  nondoterministic  machines)  but  is 
c£isy  to  verify  (in  polynomial  time): 

[Informally,]  we  view  [a  nondoterministic  algorithm]  as  being  composed  of  two  sep¬ 
arate  stages,  the  first  being  a  guessing  stage  and  the  second  a  checking  stage  .... 
(Garry  and  John.son.  1979:28) 

It  should  be  evident  that  a  “polynomial  time  nondetcrmini.stic  algorithm”  is  b.asically 
a  definitional  device  for  capturing  the  notion  of  polynomial  time  verifiability,  rather 
thiui  a  realistic  im-thod  for  .solving  derision  problems.  (:29) 

This  difference  in  difficulty  between  giie.ssing  and  verification  seems  to  fit  the  KiMMO  frame¬ 
work:  the  finite-state  two-h-vel  automata  ran  verify  a  solution  tniiekly.  Imt  it  may  still  bo  hard 
to  guess  the  correct  lexical  surf.vce  corn'spondcnce. 

'  -’Siucc  loug-iiistftiirr  riglit  coiiti-xt  is  piirt  of  the  J>rol>lrin,  it  luu*  i)rrn  snRgrsfccl  tli.if  KIM  MO  processing  in 
thf*  prolih'Tuntic  ciLMr?*  Wiiulil  !>»•  riu<i<r  if  r«irri<*<?  cnit  from  to  It  ft.  How*  vrr.  tin-  laorr  roininon  N-ff  context 

wouUl  tlicn  caiiMc  diffn  nil  i«  ami  wh.»t  coiiM  hr  <h»nr  alumt  inixrtl  riiU  m  wliit  li  Ixnli  le  ft  ami  right 

context  jilay  a  role?  In  fad.  tlir  rt’diicfioii!*  in  wetion  4  show  that  no  siinpli  fix  will  lu  lp  in  tlir  ^rm  ral  c.usr. 


It  is  not  always  apparent  from  local  evidence  how  to  construct  a  lexical  surface  corre- 
spond('n<-e  that  wiD  satisfy  the  constraints  imposed  by  a  8<'t  of  two-level  automata:  thus  the 
KlMMd  algorithms  contain  the  seeds  of  complexity.  The  next  sections  will  exploit  those  seeds 
in  mathematical  reductions  that  prove  KiMMO  recognition  and  generation  are  computation¬ 
ally  difficult  in  tlie  worst  case.  The  finite-state  two-level  framework  itself  does  not  guarantee 
computational  efficiency. 


4.  The  Complexity  of  Two-Level  Morphology 


The  reductions  in  this  section  show  tliat  two-level  automata  can  describe  computationally 
difficult  problems  in  a  very  natural  way.  It  follows  that  the  two-level  frjunework  itself  cannot 
guarantee  computational  efficiency.  If  the  words  of  natural  languages  are  easy  to  analyie, 
the  efficiency  of  processing  must  result  from  some  additional  property  that  natural  languages 
have,  beyond  those  that  arc  captured  in  the  two-level  model.  Otherwise,  romput<itionally 
difficult  problems  might  turn  up  in  the  two-level  automata  for  some  natural  language,  just  as 
they  do  in  the  artificially  constructed  languages  here.  In  fact,  the  reductions  are  abstractly 
modeled  on  the  KlMMO  treatment  of  harniony  processes  and  other  long-distance  dependencies 
in  natural  languages  (sec  §§3.3. 3, 1.2). 

4.1.  The  SAT  Problem 

The  reductions  involve  versions  of  the  Boolean  satisfiability  problem  (SAT).  An  instance 
of  SAT  consists  of  a  Boolean  formula  in  conjunctive  normal  form  (CNF),  and  the  question  to 
be  answered  is  whether  there  is  a  way  of  assigning  values  (T,F)  to  the  variables  so  that  the 
formula  comes  out  true.  Thus  the  formulas 

X 

(x  V  y)&(x  V  y) 

(x  V  y)&(j/  V  2)&(y  V  z)Si{x  V  y  V  2) 
are  satisfiable,  while  the  formulas 

xkx 

{x  V  y)&(x  V  y)kx 

(x  V  y  V  z)k(x  V  2)&(x  V  z)k{y  V  2)&(y  V  z)k{z  V  y) 

are  un.satisfiable.  The  SAT  problem  is  A/F-completc  and  thus  computationally  difficult.  The 
related  problem  3SAT  is  a  restricted  case  of  SAT  in  which  every  di.sjunction  must  have  exactly 
three  disjuncts.  (This  restricted  form  of  CNF  is  known  ;is  3CNF.)  3SAT  is  also  A  .P-complete, 
though  2SAT  is  not.*^ 

4.2.  KimmO  Generation  is  .VP-Hard 

It  is  easy  to  encode  an  arbitrary  SAT  problem  as  a  KiMMO  generation  problem.  The 
gener;il  problem  of  mapping  from  lexical  to  surface  forms  in  KlMMO  .systems  is  therefore  M P- 
hard,  i.e.  /’-complete  or  worse  (s<'e  section  C).  Formally,  define  a  iiossible  instance  of  the 
computational  problem  KIMMO  CENERATION  as  iuiy  pair  {A.rr).  wtu  re  A  is  the  automaton 
comi)oiient  of  a  KlMMO  .system  specifii’d  .as  in  (bijek  i7  <d.  (1983)  ;uk1  rr  is  a  string  over  the 
alphabet  of  the  KlMMO  system.  An  aetiiid  instance  ol  KIMMO  CE.NER.VTION  will  be  any 

*‘^For  more  extensive  theoretical  (iisenssiotis  of  efficient  processahility,  see  Berwiek  ami  Weinberg  (1082), 
Barton  (10S5a).  ami  references  cite.!  therein. 

'■’SAT  w;is  the  first  pr<thleni  to  be  proved  >/F-c<»mplete  (fatokV  Theoreui.  1971).  The  A  P-coiiipleteiiess  of 
3SAT  is  tdso  well-known.  For  details.  s*’e  (larey  Jin^l  Johnson  (1979)  or  iuiy  st.uid.ird  t('Xtt)ook. 


"z-conslatency"  3  3 

z  X  •  (lexical  eharaetera) 

T  F  =  (surface  characters) 

1 ;  2  3  1  (x  undecided) 

2:  2  0  2  (x  true) 

3:  0  3  3  (x  false) 

FiK'irp  8'.  The  KlMMORenerator  system  that  encodes  a  SAT  formula  ip  should  include  a  consistency 
automaton  of  this  form  for  every  variable  x  that  occurs  in  tp.  The  consistency  automaton  constrains 
the  m.appiiit;  from  vitfiables  in  the  lexical  string  to  trtith-v;U«ics  in  the  surfiice  string,  ensuring  that 
wliatever  viJue  is  iissigned  to  x  in  one  occurrence  must  be  assigned  to  x  in  every  occurrence. 


"satisfaction"  3  4 


T  F  -  . 
1.  2  1  3  0 
2:  2  2  2  1 
3.  1200 


(lexical  characters) 

(surface  characters) 

(no  true  seen  in  this  group) 
(true  seen  in  this  group) 

(•F  counts  as  true) 


Figure  0:  The  SAT  generator  system  for  any  formula  should  include  this  satisfaction  automaton,  which 
determines  whether  the  truth  values  assigned  to  the  variables  cause  the  formula  to  come  out  true. 
Since  the  formula  is  in  CNF.  the  requirement  is  that  the  groups  between  commas  mtist  all  contain 
at  least  one  true  value.  In  state  1,  no  true  value  has  l>een  seen;  F  cycles,  while  T  goes  to  state  2  to 
wait  for  the  cotmiia  that  begins  the  next  group.  State  3  remembers  a  preceding  minus  sign  so  that 
-F  can  count  as  true.  Only  state  2  is  a  final  state  because  only  state  2  indicates  that  a  true  value  has 
occurred. 


pos.sibic  instance  {A.  a)  such  that  for  some  o',  the  lexical -surface  pair  a  (o'  satisfies  the  con- 
striiints  imposed  by  the  automata  in  A.  Thus  {A.o)  is  an  instance  of  KIMMO  GENERATION 
if  there  is  any  sttrfacc  string  that  can  be  generated  from  the  lexical  string  o  according  to  the 
atitoinata.  (A.s  the  problem  is  defined,  an  algorithm  is  not  required  to  exhibit  the  surface 
Strings  that  can  be  generated,  but  only  to  say  whether  there  are  any.) 

To  encode  a  SAT  problem  ^  as  a  pair  {A,o),  first  construct  a  from  the  CNF  for¬ 
mula  liy  a  notational  translation.  Use  a  minus  sign  for  negation,  a  comma  for  conjunc¬ 
tion.  and  no  explicit  operator  for  disjunction.  Then  the  a  corresponding  to  tlic  formula 
(r  V  y)ti(y  V  z]&i(x  V  j/  V  r)  is  -xy,-yz,xyz.  The  notation  is  tuiambiguous  without  paren- 
the.ses  because  p)  is  required  to  bo  in  CNF. 

Second,  constnirt  A  (in  polynomial  time)  in  three  parts.  {A  varies  from  formula  tofonntila 
only  when  the  formiiltLs  involve  dilferent  .sots  of  variables.)  The  alphabet  sperifieation  .should  list 
till-  variables  in  o  together  with  llie  spex  ial  characters  T.  F,  minus  sign,  and  comma.  The  equals 
sign  should  be  declared  as  the  KiMMO  wihletird  eliartictcr,  as  usmd.  Tito  consistency  automata, 
oiie  for  eaeli  variable  in  o.  sbouhl  be  eonstructed  iis  in  Figure  8.  Tin’  satisfaction  automaton 
slioiild  Ix'  copied  from  Figun-  9  luid  does  not  vary  from  formula  to  formula.  Figure  10  lists 
the  entire  SAT  generator  systr’in  A  for  IVrrnmlas  <p  that  use  variables  x,  y,  and  z. 

The  generator  syst«'in  ii.sed  in  tliis  roiistrnrtion  is  set  np  so  that  .surface  .strings  are  identical 


,>  iJ*.. 


18 


X  *  ■ 

T  F  • 

1;  2  3  1 

2:  2  0  2 

3:  0  3  3 

"y-cons1stency"  3  3 

y  y  ■ 

T  F  - 

1-231 
2:  2  0  2 

3:  0  3  3 

■■/-consistency"  3  3 
z  z  ■ 

T  F  ■ 

1:  2  3  1 

2:  2  0  2 

3;  0  3  3 

"satisfaction"  3  4 

T  F  -  ! 

1.  2  1  3  0 

2;  2  2  2  1 

3.  12  0  0 


to  lexical  strings,  but  -with  truth  values  substituted  for  the  variables.  Thus  any  surface  string 
generated  from  a  will  directly  exliibit  a  satisfying  truth-assignment  for  The  consistency 
automaton  for  each  variable  x  ensures  that  the  value  assigned  to  z  is  consistent  throughout 
the  string.  In  state  1,  no  truth-value  has  been  assigned  and  either  z/T  or  z/F  is  acceptable. 
In  state  2,  z/T  has  been  chosen  once  and  therefore  only  x/T  can  be  permitted  for  other 
occurrences  of  z.  Similarly,  state  3  allows  only  z/F.  All  of  the  states  of  the  z-consistcncy 
automaton  ignore  punctuation  marks  and  variables  other  than  x.  The  satisfaction  automaton 
blocks  if  any  disjunction  contains  only  F  and  -T  after  truth-values  have  been  substituted  for  the 
variables;  thus  the  satisfaction  automaton  will  end  up  in  a  final  state  only  if  the  truth-values 
that  have  been  assigned  satisfy  every  disjunction  and  hence 

The  net  result  of  the  constrmnts  imposed  by  the  consistency  and  satisfaction  automata 
is  that  some  surface  string  can  be  generated  from  a  just  in  case  the  original  formula  <f>  has 
a  satisfying  truth-assignment.  Furthermore,  the  pair  (A.o)  can  be  constructed  in  time  poly¬ 
nomial  in  the  length  of  <^;  thus  SAT  is  polynomial-time  reduced  to  KIMMO  GENERATION, 
and  the  general  case  of  KIMMO  GENERATION  is  at  least  as  hard  as  SAT.  Figure  11  traces 
the  operation  of  the  KiMMO  generation  iilgorithm  on  a  satisfiable  formul;i;  note  that  the  gen¬ 
erator  goes  through  quite  a  bit  of  searcli  even  though  there  turns  out  to  be  only  one  answer. 
Figure  12  shows  what  happens  with  an  unsatisfiable  formula. 


4.3.  KIMMO  Recognition  is  A//^-Hard 


Like  tlie  generator,  the  KiMMO  recognizer  ran  be  used  to  solve  roniputatioiiitlly  diffi¬ 
cult  problems  KiMMO  ri’cognition  iuid  KiMMO  giuieration  are  both  //."-h;ird.  To  treat  the 
recognizer  formally,  di’fine  a  [lossible  instance  of  the  roirijiutatiomd  problem  KIMMO  REfX)G- 
NITION  as  any  triple  (A.  D.rr).  where  A  and  n  are  jis  before,  and  D  is  the  dictionary  comiio- 
iK'iit  of  a  KIMMO  system  ilesrribed  as  .s])ecilii'<l  in  Giijek  rt  a/.  (1983).  An  actual  instance  of 
Kl.MMO  REGOGNITION  will  be  any  possible  inst.-inre  (A.  P.n)  surh  that  for  some  n' ,  (i)  the 
li-xical  surfai  <’  p;iir  a'  jer  satisfies  the  constraints  imposed  by  the  autom;)t;i  in  A  ;ls  bi’fore, 
and  (ii)  rr'  r;in  be  generated  by  the  dirtioii.'U’y  component  D.  Thus  {A.  D.rr)  is  an  instance  of 


Figure  10:  This  is  the  complete  KiMMO  generator 
system  for  solving  SAT  problems  in  the  variables 
X,  y,  and  z.  The  system  includes  a  consistency  au¬ 
tomaton  for  each  variable  in  addition  to  a  satisfac¬ 
tion  automaton  that  does  not  vary  from  problem 
to  problem. 


ALPHABET  x  y  z  T  F  -  . 

ANY  • 

END 


(■eriBraUng  from  lexical  form  " - »y . -y/ .  - y- 7 .  xyr " 


1 

- 

1.1. 1.3 

38 

•f 

-FF.-FT.-F-T, 

FFT 

3. 3. 2. 2 

2 

-F 

3. 1.1. 2 

39 

"-FF.-FT.-F-T 

.FFT" 

•••  result 

3 

-FF 

3.3. 1.2 

40 

(3) 

-FT 

3.2. 1.2 

4 

-FF. 

3. 3. 1.1 

41 

-6T. 

3. 2. 1,1 

S 

“FF. 

- 

3.3. 1.3 

42 

-FT,  - 

3. 2, 1.3 

6 

“FF. 

-T 

XXX  y-con. 

43 

-FT. -F 

XXX  y-con. 

7 

-FF. 

-F 

3. 3. 1.2 

44 

•f 

-FT.-T 

3. 2. 1.1 

8 

-FF. 

-FF 

3. 3. 3. 2 

46 

-FT.-TF 

3.2,3. 1 

9 

-FF. 

-FF. 

3. 3. 3.1 

46 

-FT.-TF, 

XXX  satis. 

10 

-FF. 

-FF. 

- 

3. 3. 3. 3 

47 

(45) 

-FT,-TT 

3, 2, 2. 2 

11 

-FF. 

-FF. 

-T 

XXX  y-con. 

48 

-FT.-TT, 

3,2.2,  1 

12 

•f 

-FF. 

-FF. 

-F 

3. 3. 3. 2 

49 

-FT.-TT.- 

3. 2. 2. 3 

13 

-FF. 

-FF. 

-F- 

3. 3. 3. 2 

60 

-FT,-TT.-F 

XXX  y-con. 

14 

-FF. 

-FF  . 

-F-T 

XXX  r-con. 

51 

4- 

-FT,-TT,-T 

3.2,2. 1 

15 

'f 

-FF  . 

-FF. 

-F-F 

3. 3. 3. 2 

62 

-FT,-TT.-T- 

3, 2, 2. 3 

16 

-FF, 

-ff. 

-F-F. 

3.3.3. 1 

53 

-FT,-TT.-T-F 

XXX  2-con, 

17 

-FF  . 

-FF. 

-F-F  .T 

XXX  K-con. 

54 

+ 

-FT.-TT.-T-T 

3. 2. 2.1 

18 

'f 

-FF. 

-FF. 

-F-F  .F 

3. 3. 3.1 

55 

-FT.-TT,-T-T, 

XXX  satis. 

19 

-ff  . 

-FF, 

-F-r  .FT 

XXX  y-con. 

56 

(2) 

-T 

2, 1.1.1 

20 

+ 

-FF  . 

-FF. 

-F-F.FF 

3. 3. 3.1 

67 

-TF 

2.3. 1.1 

21 

-ff. 

-FF. 

-F-F.FFT 

XXX  i-con. 

58 

-TF, 

XXX  satis. 

22 

■f 

-FF  . 

-FF  . 

-F-F ,FFF 

3.3.3. 1 

59 

(57) 

-TT 

2.2. 1.2 

23 

-FF, 

-ff  . 

-F-F.FFF 

XXX  salts,  nf. 

60 

-TT, 

2, 2. 1.1 

24 

(8) 

-FF, 

-FT 

3. 3. 2. 2 

61 

-TT,  - 

2.2, 1,3 

25 

“FF. 

-FT. 

3. 3. 2.1 

62 

-TT. -F 

XXX  y-con. 

26 

-ff  . 

-FT. 

- 

3. 3. 2. 3 

63 

* 

-TT.-T 

2.2. 1 . 1 

27 

-FF. 

-ft. 

-T 

XXX  y-con. 

64 

-TT. -TF 

2.2.3.  1 

28 

♦ 

-FF  . 

-FT. 

-F 

3. 3. 2. 2 

66 

-TT.-TF. 

XXX  satis. 

29 

“FF, 

-FT. 

-F- 

3. 3. 2. 2 

66 

(64) 

-TT.-TT 

2. 2, 2, 2 

30 

-FF  . 

-FT. 

-F-F 

XXX  2-con. 

67 

-TT.-TT, 

2.2.2, 1 

31 

•f 

-FF  . 

-FT. 

-F-T 

3. 3. 2. 2 

68 

-TT, -TT, - 

2. 2, 2. 3 

32 

'FF  . 

-FT. 

-F-T. 

3.3.2. 1 

69 

-TT,-TT,-F 

XXX  y-con. 

33 

*ff  . 

-FT. 

-F-T.T 

XXX  x-con. 

70 

•f 

-TT, -TT.-T 

2,2.2, 1 

34 

+ 

-ff  . 

-FT. 

-F-T.F 

3. 3. 2.1 

71 

-TT,-TT.-1- 

2. 2. 2. 3 

35 

-ff  . 

-FT. 

•F-T. FT 

XXX  y-con. 

72 

-TT.-TT.-r-F 

XXX  r-con. 

36 

* 

'FF  , 

-FT. 

-F-T,FF 

3.3.2. 1 

73 

4- 

-TT,-TT,-T-T 

2.2.2. 1 

37 

-FF. 

■FT. 

-F-T.FFF 

XXX  2-con. 

74 

-TT.-TT. -T-T, 

XXX  satis. 

("- 

FF. 

-FT.- 

F-T.FFT") 

I’ll'tirr  U:  Tho  KlMMOgenorator  system  of  Figure  10  goes  through  these  steps  when  applied  to  the 
encoded  version  of  the  (aatisliahlel  formula  (i  V  y)i£{y  V  V  5)&(i  V  y  V  e).  Though  only  one 

trutli-assigiiinont  will  satisfy  the  formula,  it  .takes  <|uite  a  bit  of  backtracking  to  find  it.  The  notation 
u.“ed  here  for  de.scribijig  generator  actions  hs  siiiiihar  to  that  used  to  describe  recognizer  actions  in 
Figure  2,  but  a  surface  rather  than  a  lexical  string  is  the  goal.  As  in  figure  7,  a  ♦-entry  in  the 
backtracking  colimui  indicates  backtracking  from  an  immediate  failure  in  the  preceding  step,  which 
does  not  nspiire  the  full  backtracking  meclianisin  to  be  invoked. 


KIMMO  RECOGNITION  if  <t  is  a  rccogtiizablc  word  according  to  the  constraints  of  A  and 
D. 

Many  reductions  are  possible,  but  the  reduction  that  will  be  sketched  here  uses  the  3SAT 
problem  instead  of  SAT.  It  also  uses  an  encoiling  for  CNF  fornnilas  that  is  slightly  different 
from  tlie  one  used  in  the  generator  reduction.  To  encode  a  SAT  problem  ^  a.s  a  triple  (A,  D,a), 
first  construct  a  from  <p  by  a  new  notational  translation.  This  time,  treat  a  variable  x  and 
its  ni’gation  x  as  separate,  atomic  characters.  Continue  to  usi-  a  comma  for  cojijunction  and 
no  rxidicit  operator  for  di.sjuucfion.  but  now  add  a  period  at  the  end  of  the  formula.  Then 
the  a  corresponding  to  the  formula  (x  V  r  V  y)<fc(j/  V  y  V  2)&(x  V  j/  V  c)  is  xxy ,  yyz , xyz . , 
a  string  of  12  characters.  (Witli  3SAT.  the  commas  are  redundant,  but  they  arc  retained  here 
in  the  interest  of  readability.) 

Second,  construct  A  (in  polynomial  time)  in  two  jiarts.  (A.s  before,  A  vfiries  from  formula 
to  formula  only  when  the  formulas  iiivolv<‘  different  sets  of  varialih-s.)  The  alphabet  spccifi- 
c.iiioii  slioiild  list  tlie  variables  in  cr  togetluT  with  their  negations  ;uid  the  special  cliaracters 
T,  F.  comma.  ;uid  jicriod.  The  equals  sign  .should  again  be  declared  ius  the  KiMMO  wildcard 
character.  Tin’  consistency  automata,  still  oiu'  for  each  Viiriable  in  a.  .should  be  constructed 


Generating  from  le«1cal  form  "syr. 

-x-z, -xz . -y-z. 

-yz, -zy 

"  . 

1 

F 

3. 1.1.1 

71 

FTT, -T 

XXX  x-con 

2 

FF 

3. 3. 1.1 

72 

♦ 

FTT.-F 

3, 2, 2, 2 

3 

FFF 

3. 3. 3.1 

73 

FTT,-F- 

3, 2, 2, 2 

4 

FFF. 

XXX  satis. 

74 

FTT,-F-F 

XXX  z-con 

6 

(3) 

FFT 

3. 3. 2, 2 

75 

♦ 

FTT, -F-T 

3. 2. 2. 2 

6 

FFT. 

3. 3. 2.1 

76 

FTT. -F-T 

3.2,2,  1 

7 

FFT,- 

3. 3. 2. 3 

77 

FTT, -F-T, 

- 

3. 2. 2. 3 

8 

FFT.-T 

XXX  x-con. 

78 

FTT, -F-T, 

-T 

XXX  x-con 

9 

■¥ 

FFT.-F 

3. 3. 2. 2 

79 

4 

FTT. -F-T 

-F 

3. 2, 2. 2 

to 

FFT.-F- 

3. 3. 2. 2 

80 

FTT, -F-T, 

-FF 

XXX  z-con 

n 

FFT.-F-F 

XXX  z-con. 

81 

4 

FTT, -F-T, 

-FT 

3, 2. 2. 2 

12 

♦ 

FFT.-F-T 

3. 3. 2. 2 

82 

FTT, -F-T, 

-FT, 

3,2,2.  1 

13 

FFT.-F-T. 

3.3.2. 1 

63 

FTT, -F-T, 

-FT.  - 

3. 2. 2. 3 

14 

FFT.-F-T,- 

3. 3. 2. 3 

84 

FTT, -F-T, 

-FT, -F 

XXX  y-con. 

IS 

FFT,-F-T.-T 

XXX  x-con. 

85 

4 

FTT, -F-T, 

-FT,-T 

3.2,2. 1 

16 

■¥ 

FFT,-F-T.-F 

3. 3. 2. 2 

86 

FTT, -F-T, 

-FT.-T- 

3. 2, 2, 3 

17 

FFT.-F-T.-FF 

XXX  z-con. 

87 

FTT, -F-T, 

-FT,-T-F 

XXX  z-con. 

18 

♦ 

FFT.-F-T. -FT 

3. 3. 2. 2 

88 

4 

FTT. -F-T. 

-FT, -T-T 

3, 2, 2.1 

19 

FFT.-F-T. -FT. 

3.3.2. 1 

89 

FTT. -F-T, 

-FT, -T-T. 

XXX  satis. 

20 

FFT.-F-T.-FT.- 

3. 3. 2. 3 

90 

(1) 

T 

2. 1.1. 2 

21 

FFT.-F-T.-FT.-T 

XXX  y-con. 

91 

TF 

2, 3. 1,2 

22 

•f 

FFT. -F-T. -FT. -F 

3. 3. 2. 2 

92 

TFF 

2, 3. 3. 2 

23 

FFT.-F-T.-FT.-F- 

3. 3. 2. 2 

93 

TFF. 

2. 3, 3.1 

24 

FFT.-F-T.-FT.-F-F 

XXX  z-con. 

94 

TFF,- 

2. 3, 3. 3 

25 

*■ 

FFT. -F-T. -FT. -F-T 

3. 3. 2. 2 

95 

TFF,-F 

XXX  x-con. 

26 

FFT. -F-T. -FT. -F-T. 

3. 3. 2.1 

96 

4 

TFF,-T 

2, 3. 3.1 

27 

FFT.-F-T.-FT.-F-T.- 

3. 3. 2. 3 

97 

TFF.-T- 

2. 3. 3. 3 

26 

FFT.-F-T.-FT,-F-T.-T 

XXX  y-con. 

98 

TFF, -T-T 

XXX  z-con. 

29 

■f 

FFT. -F-T. -FT. -F-T. -F 

3. 3. 2. 2 

99 

4 

TFF, -T-F 

2, 3, 3. 2 

30 

FFT. -F-T. -FT. -F-T. -FF 

XXX  z-con. 

100 

TFF, -T-F. 

2, 3. 3.1 

31 

♦ 

FFT. -F-T. -FT. -F-T. -FT 

3. 3. 2. 2 

101 

TFF, -T-F. 

- 

2. 3. 3. 3 

32 

FFT. -F-T. -FT. -F-T. -FT. 

3. 3. 2.1 

102 

TFF, -T-F. 

-F 

XXX  x-con. 

33 

FFT.-F-T.-FT,-F-T.-FT.- 

3. 3. 2. 3 

103 

4 

TFF, -T-F. 

-T 

2.3.3. 1 

34 

FFT.-F-T.-FT.-F-T.-FT.-F 

XXX  z-con. 

104 

TFF. -T-F. 

-TT 

XXX  z-con. 

35 

FFT. -F-T. -FT, -F-T. -FT. -T 

3,3.2. 1 

105 

4 

TFF. -T-F, 

-TF 

2, 3, 3.1 

36 

TFT. -F-T. -FT. -F-T. -FT. -TT 

XXX  y-con. 

106 

TFF, -T-F. 

-TF, 

XXX  satis. 

37 

+ 

FFT. -F-T. -FT, -F-T. -FT. -TF 

3.3,2. 1 

107 

(92) 

TFT 

2. 3, 2. 2 

38 

FFT. -F-T. -FT. -F-T. -FT, -TF 

XXX  satis,  nf. 

108 

TFT, 

2.3,2. 1 

39 

(2) 

FT 

3.2. 1.2 

109 

TFT,- 

2. 3, 2. 3 

40 

FTF 

3. 2. 3. 2 

no 

TFT,-F 

XXX  x-con. 

41 

FTF. 

3. 2. 3.1 

111 

4 

TFT,-T 

2, 3. 2.1 

42 

FTF.- 

3. 2, 3. 3 

112 

TFT,-T- 

2, 3, 2. 3 

43 

FTF.-T 

XXX  K-Con. 

113 

TFT,-T-F 

XXX  Z-con. 

44 

* 

FTF,-F 

3. 2, 3. 2 

114 

4 

TFT, -T-T 

2, 3. 2.1 

45 

FTF.-F- 

3. 2. 3. 2 

116 

TFT. -T-T, 

XXX  satis. 

46 

FTF.-F-T 

XXX  z-con. 

116 

(91) 

TT 

2. 2. 1.2 

47 

•f 

FTF.-F-F 

3. 2. 3. 2 

117 

TTF 

2. 2. 3. 2 

48 

FTF,-F-F, 

3.2.3. 1 

118 

TTF, 

2,2.3.  1 

49 

FTF,-F-F,- 

3, 2. 3. 3 

119 

TTF,- 

2. 2. 3. 3 

SO 

FTF.-F-F.-T 

XXX  x-con. 

120 

TTF,-F 

XXX  x-con. 

51 

*■ 

FTF,-F-F,-F 

3. 2. 3. 2 

121 

4 

TTF,-T 

2, 2. 3.1 

52 

FTF,-F-F.-FT 

XXX  z-con. 

122 

TTF.-T- 

2, 2. 3. 3 

63 

* 

FTF,-F-F.-FF 

3. 2. 3, 2 

123 

ttf.-t-t 

XXX  z-con. 

54 

FTF.-F-F.-FF. 

3.2,3. 1 

124 

4 

TTF. -T-F 

2, 2. 3. 2 

55 

FTF,-F-F,-FF,- 

3. 2. 3. 3 

126 

TTF. -T-F, 

2,2,3, 1 

66 

FTF,-F-F.-FF.-F 

XXX  y-con. 

126 

TTF. -T-F. 

- 

2, 2, 3, 3 

57 

•f 

FTF.-F-F,-FF.-T 

3,2.3. 1 

127 

TTF, -T-F. 

-F 

XXX  x-con. 

68 

FTF.-F-F.-FF.-T- 

3. 2. 3, 3 

128 

4 

TTF. -T-F, 

-T 

2, 2. 3,1 

59 

FTF.-F-F.-FF,-T-T 

XXX  z-con. 

129 

TTF. -T-F, 

-TT 

XXX  z-con. 

60 

*• 

FTF.-F-F.-FF.-T-F 

3, 2. 3. 2 

130 

4 

TTF , -T-F, 

-TF 

2,2.3,  1 

61 

FTF , -F-F. -FF. -T-F. 

3,2,3. 1 

131 

TTF , -T-F, 

-TF. 

XXX  satis. 

62 

FTF.-F-F.-FF, -T-F,- 

3. 2. 3. 3 

132 

(117) 

TTT 

2, 2. 2. 2 

63 

FTF , -F-F ,  -FF , -T-F , -F 

XXX  y-con. 

133 

TTT, 

2.2,2,  1 

64 

♦ 

FTF.-F-F .-FF,-T-F,-T 

3.2.3. 1 

134 

ttt.- 

2, 2. 2, 3 

65 

FTF  ,  -F  -F . -FF . -T-F . -TT 

XXX  7-con. 

135 

TTT. -F 

XXX  x-con . 

66 

+ 

FTF.-F-F.-FF.-T-F.-TF 

3. 2. 3.1 

136 

4 

TTT  ,  -T 

2. 2, 2,1 

67 

FTF.-F-F. -FF. -T-F . -TF , 

XXX  satis. 

137 

TTT,-T- 

2, 2, 2. 3 

68 

(40) 

FTT 

3. 2. 2. 2 

138 

TTT . -T-F 

XXX  7-con. 

69 

FTT, 

3.2.2. 1 

139 

4 

TTT, -T-T 

2,2,2,  1 

70 

FTT,  - 

3. 2. 2. 3 

140 

TTT, -T-T, 

XXX  satis. 

Figure  12:  Tin-  KiMMO  gi-iicrator  system  of  Figure  10  goes  tliroiigh  140  steps  before  verifying  that  tlie 
fonimia  (i  V  y  V  V  :]&c{x  V  2)S:(y  V  i)&(y  V  V  y)  has  no  satisfying  truth-assignment. 


"z-consistency"  3  6 

T  T  F  F  =  (lexical  characters) 
z  I  z  X  =  (surface  characters) 

1 ;  2  3  3  2  1  (x  undecided) 

2:2  0  0  2  2  (x  true) 

3:  0  3  3  0  3  (x  false) 

FiRurc  13:  The  KiMMO  recoRiiizor  system  that  encodes  a  3SAT  formula  ip  should  include  a  consistency 
automaton  of  this  form  for  every  variable  x  that  occurs  in  p.  As  in  tlie  Rcneratnr  reduction,  the 
cnnsisteiiry  automaton  consfrains  the  iiiappinR  from  variables  to  truth-values,  ensuring  that  the  value 
.assigned  to  x  is  consistent  throughout  the  formula.  However,  in  the  recognizer  reduction  the  automaton 
must  also  ensure  that  the  values  assigned  to  i  and  x  are  opposites,  since  x  and  x  are  treated  as  atomic 
alphabet  characters. 


ALTERNATIONS 

(  Root  »  Root  ) 

(  Punct  »  Ponct  ) 

i  »  '  ) 

END 


LEXICON 

Root 

TTT 

Punct 

m  m 

TTF 

Punct 

m  » 

TFT 

Punct 

f»  II 

TFF 

Punct 

»  m 

FTT 

Punct 

m  f» 

FTF 

Punct 

»  « 

FFT 

Punct 

11  n 

LEXICON 

Punct 

Root 

... 

# 

n  n  , 

Figure  14:  The  3SAT  recognizer  system  for  any  formula  should  include  this  dictionary  component, 
which  ensures  that  the  truth-values  iLssigned  to  the  variables  in  the  surface  string  will  cause  the 
formula  to  come  out  true.  All  combinations  of  three  truth  values  are  listed,  except  for  the  value  FFF 
that  would  cause  one  of  the  3('NF  di.sjunctions  to  be  f;Jse:  the  same  diction.ary  component  is  used  for 
.ill  3.SAT  problems,  bach  lexicon  entry  specifies  the  rontinuation  class  of  lexicons  that  can  follow.  For 
instance,  the  class  Punct  containing  only  the  lexicon  Punct  is  the  continu.ation  class  of  TTT,  while  the 
class  of  .  IS  the  empty  continuation  class  #.  ""  is  an  empty  feature  set,  used  since  no  word  features 
are  being  recovereil  in  this  mathematic.al  reduction.  The  dct.ailed  format  of  the  dictionary  component 
IS  desi-ribed  in  Liajek  et  al.  (1083). 


as  in  Figure  13.  There  is  no  satisfaction  automaton  in  this  version  of  the  recognizer. 

Fiimlly.  take  D  a.s  a  con.staiit  from  Figure  14.  In  this  reduction,  D  imposes  the  satisfaction 
eon.straiiit  that  w.n-s  enforced  with  .ui  automaton  in  the  generator  n-duction.  Formula  <p  will 
be  s.ill.sfied  iff  till  of  it.s  conjuncts  are  satisfied,  and  since  ip  is  in  3(1NF.  that  means  the  truth- 
viilttes  .assigneil  within  I'.'ich  disjunction  must  be  TTT,  TTF,  ...,  nr  .my  combination  of  three 
triith-vidiies  exce])t  FFF.  This  is  exactly  the  constriiint  imposed  by  tlie  diction.u'y.  (Note  that 
D  is  the  same  for  every  3SAT  problem;  it  does  not  grow  with  tlie  size  of  tlie  formula  or  the 
number  of  variables.) 

('oinp.-ired  to  the  generator  reduction,  the  roles  of  the  lexical  ;uid  stirface  strings  are 


reversed  in  the  recognizer  rcdnrtion.  The  snrfarc  string  encodes  ip,  while  tl)c  lexical  string 
indicates  truth-values  for  its  v.iriables.  The  consistency  automaton  for  each  variable  x  still 
ensures  that  the  value  assigned  to  x  is  consistent  throughout  the  formula,  but  now  it  idso 
ensures  that  x  and  x  are  .issigned  opposite  Vitlues.  As  before,  th<'  net  result  of  the  constraints 
imposed  by  the  various  components  is  that  {A^D.a)  is  in  KIM.MO  REOOdNITION  just  in 
case  <p  lias  a  satisfying  truth-assignment.  The  general  case  of  KIMMO  IIE(U)GN1TI0N  i.s  at 
lefust  as  hard  as  3SAT,  hence  at  least  as  hard  as  SAT  or  any  otJier  problem  in  MP  (in  the 
sense  of  polynomial-time  reduction). 


5.  The  Effect  of  Precompilation 


Tlic  reductions  presented  in  section  4  require  botji  the  lanpuajje  description  and  the  input 
striiiR  to  viiry  witli  tlie  SAT/3SAT  problem  to  be  s»)lved.  Hence,  tlu're  arises  the  question 
of  whetlicr  some  computationally  intensive  form  of  preconipilation  could  blunt  the  force  of 
the  reduction,  paying  a  potentially  exponential  compilation  c«)st  once  iind  allowing  KiMMO 
runtime  for  a  given  grammar  to  be  uniformly  fast  thereafter.  This  section  examines  four 
aspects  of  the  precompilation  question. 

5.1.  Conversion  to  GMACHINE/rmachinE  Form 

The  external  description  of  a  KlMMO  automaton  or  lexicon  is  not  the  same  as  the  form 
tliat  is  used  by  the  generation  or  re<’ognition  algorithm  at  runtime.  Instead,  the  external  de¬ 
scriptions  iu'e  use<l  to  construct  internal  forms:  RMACllINE  and  CMACHINE  forms  for  automata, 
and  letter  trees  for  lexicons  (Gajek  el  al.,  1983).  Hence  one  question  to  address  is  whether  the 
complexity  implied  by  the  reduction  might  actually  apply  to  the  construction  of  these  internal 
forms.  If  this  were  true,  then  the  complexity  of  the  generation  problem  (for  instance)  would 
be  concentrated  in  the  construction  of  the  ‘‘feasible-i>air  list”  ajid  the  GMACHINE. 

It  is  possible  to  de<d  with  this  qtiestion  directly  by  reformulating  the  reduction  so  that  the 
forma]  problems  and  the  construction  specify  macliines  in  terms  of  their  internal  (e.g.  GMA¬ 
CHINE)  forms  instead  of  their  external  descriptions.  The  GMAClIINEs  for  the  class  of  machines 
created  in  the  construction  have  a  very  regular  structure,  and  it  is  easy  to  build  them  directly 
instead  of  building  descriptions  ui  external  format.  As  Figure  II  also  suggested,  it  is  runtime 
processmg  that  makes  translated  SAT  problems  difficult  for  a  KiMMO  system  to  solve. 

5.2.  BIGMACHINE  Precompilation 

There  is  also  another  kind  of  preprocessing  that  might  be  expected  to  help.  As  men¬ 
tioned  in  section  2.1.2,  it  is  pos.sible  to  compile  a  set  of  KiMMO  automata  into  a  single  large 
automaton  that  will  run  faster  than  the  original  set.  The  system  wiU  usually  run  faster  with 
one  large  automaton  than  with  several  small  ones,  since  it  ha«  only  one  machine  to  step  and 
the  .speed  of  stepping  a  machine  is  largely  independent  of  its  .sire.  However,  in  the  worst  case 
the-  merged  autom.aton  is  prohibitively  large,  exponentially  larger  than  the  smaller  machines 
(Karttuiien.  1983:17G). 

Gajek  et  al.  (1983)  u.sc  the  terms  UIGGMACHINE  and  DIGRMACIIINE  to  refer  to  the  gener¬ 
ation  and  recognition  ver.sions  of  a  large  merg<'d  automaton,  and  therefore  such  an  automaton 
will  be  callj'd  a  HIGMACHINE.  Since  it  c.in  take  exponential  time  to  build  the  BIGMACHINE 
for  a  translated  SAT  problem,  the  reduction  formally  allows  the  possibility  that  BIGMACHINE 
precompilation  could  make  runtime  processing  uniformly  efficient. 

However,  an  expensive  BIGMACHINE  pn*com]>ilation  step  doesn’t  help  runtime  processing 
enough  to  change  the  fundamental  comi)lexity  of  the  algorithms.  Recjill  from  section  3.3  that 
the  main  ingre<lients  of  KiMMO  runtime  complexity  are  the  mechanical  oj>eration  of  the  au¬ 
tomata.  the  difficulty  of  finding  the  correct  lexical  surfjice  correspondence,  and  the  necessity 
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of  choosing  among  alternative  lexicons.  BIGMACHINE  precoiiipilation  wiU  speed  up  the  me¬ 
chanical  operation  of  the  automata,  perhaps  by  a  factor  equal  to  the  number  of  variables  in 
the  SAT  query.  However,  it  will  not  help  in  the  tjisk  of  deciding  which  lexical/surface  pair  will 
be  globally  acceptable.  The  BIGMACHINE  will  be  a-s  limited  as  the  equivalent  automata  in  its 
forecasting  abilities.  Precompilation  oils  the  machinery,  but  doesn’t  accomphsh  fundamental 
redesign. 

5.3.  BIGMACHINE  Size  and  the  Interaction  of  Constraints 

BIGMACHINE  prccompilation  sheds  light  on  another  precompilation  question  as  well.  It 
is  known  that  the  compiled  BIGMACHINE  corresponding  to  a  set  of  KlMMO  automata  can  be 
exponentially  larger  than  the  original  system  in  the  worst  case;  for  exami)le,  such  blowup 
occurs  if  the  SAT  automata  are  compiled  into  a  BIGMACHINE  In  practice,  however,  the  size 
of  the  BIGMACHINE  varit^s  —  thus  naturally  raising  the  question  of  what  distinguishes  the 
“explosive’’  sets  of  automata  from  those  that  behave  more  tractably. 

It  is  sometimes  suggested  that  the  degree  of  interaction  among  eonetraints  determines 
the  amount  of  BIGMACHINE  blowup.  In  tliis  view,  a  large  BIGMACHINE  for  a  SAT  problem  is 
no  surprise,  for  the  computational  dilficulty  of  SAT  and  similar  problems  results  in  part  from 
their  “global"  character.  Their  solutions  generally  cannot  be  deduced  pie<-c  by  piece  from 
local  evidence;  instead,  the  acceptability  of  each  part  of  the  solution  may  depend  on  the  whole 
problem.  In  the  worst  case,  the  solution  is  determined  by  a  complex  conspiracy  among  the 
constraints  of  the  problem.  Thus  the  large  BIGMACHINE  gives  a  more  “honest”  estimate  of 
problem  difficulty  than  the  small  collection  of  individual  automata. 

However,  a  slight  change  in  the  SAT  automata  demonstrates  that  BIGMACHINE  size  need 
not  correspond  to  the  degree  of  inteniction  among  the  automata.  Eliminate  the  satisfaction 
automaton  from  the  generator  system,  leaving  only  the  consistency  automata  for  the  variables. 
Then  the  system  will  not  search  for  a  aatiufying  truth-assigment.  but  merely  for  one  that  is 
internally  consistent  —  that  is,  one  that  never  as.signs  both  T  and  F  to  the  same  variable  in  its 
different  occurrences.  This  change  will  entirely  eliminate  the  interactions  among  the  automata; 
each  automaton  is  concerned  only  with  the  assiginents  to  its  particular  variable,  and  there  is  no 
way  for  an  assignment  to  one  variable  to  influence  the  acceptability  iui  assignment  to  another. 

Yet  despite  the  elimination  of  interactions,  the  BIGMAC^HINE  must  still  be  exponentially 
larger  thjui  the  collection  of  individu.d  automata.  Since  the  states  of  the  UIGMAC.’HINE  must 
distinguish  <dl  the  possible  truth-assigninents  to  the  variables,  its  size  must  be  exjnniential  in 
the  number  of  individual  automata.  In  fact,  the  lack  of  interactions  can  actually  increase  the 
number  of  states  in  the  DIGMACtHINE.  IntenactKins  iunong  the  automata  const  r;un  the  com¬ 
binations  of  states  that  can  be  reached,  thus  reducing  the  munber  of  accessible  combinations 
b«-low  the  mathematical  upper  limit. 

5.4.  Transducers  and  Determinization 

One  more  precompilation  (piestion  is  whether  the  nondeterminism  involved  in  constructing 
the  lexical  surface  correspondence  can't  be  removed  by  stamhird  determinization  t<'chniques 
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Figiiro  15:  This  nondctcrininistic  finite-stato  transducer  cannot  he  determinized.  An  equivalent  de- 
ternuuistic  FST  would  liave  to  wait  for  the  end  of  the  input  striiiR  before  Ronerating  any  output. 
However,  at  tliat  point  it  would  have  to  r<-iiieinber  how  many  as  or  6s  to  output  in  correspondence 
with  the  unbounded  number  of  is  in  the  string  -  an  impossible  task  for  a  finite-state  device. 


for  finite-state  machines.  After  all,  every  noiideterministic  finite-state  machine  has  a  deter¬ 
ministic  counterpart  that  is  equivalent  in  the  sense  that  it  accepts  the  same  language.*®  Aren’t 
KIMMO  automata  just  ordinary  finite-state  machines  operating  over  an  alphabet  that  happens 
to  consist  of  pairs  of  characters? 

It  is  indeed  possible  to  view  KiMMO  automata  in  this  way  when  they  are  being  used  to 
verify  or  reject  hypothc'siied  pairs  of  lexical  and  surface  strings.**  However,  in  this  use  they 
don't  need  detertninizing:  they  are  already  deterministic,  for  there  is  only  one  new  state  listed 
in  each  cell  of  the  description  of  a  KiMMO  automaton.  In  the  cases  of  primary  interest  — 
generation  and  recognition  —  the  machines  arc  being  used  as  genuine  transdueera  rather  than 
acceptors. 

The  determinizing  algorithms  that  apply  to  finite-state  acceptors  will  not  work  on  trans¬ 
ducers.  Indeed,  many  finite-state  transducers  are  not  determinizable  at  all.  For  example, 
consider  the  transducer  in  Figure  15.  On  input  xxxxxa  it  must  output  aaaaan,  while  on  input 
xxxxxb  it  must  output  bhhbbb.  An  equivalent  deterministic  finite-state  transducer  is  impossible. 
A  deterministic  transducer  could  not  know  whether  to  outjmt  o  or  6  upon  .seeing  z.  However, 
it  idso  could  not  output  nothing  and  put  olf  the  decision  until  later:  being  finite-state,  it  would 
not  in  general  Ix'  able  to  remember  at  the  end  how  many  occurrences  of  x  there  had  been,  so 
it  would  not  be  able  to  print  the  right  number  of  initial  occurrences  of  a  or  b. 

For  similar  reasons,  ther<>  is  no  way  to  build  deterministic  finite-state  transducers  for  the 
S.\T  i)rofilems.  Upon  seeing  the  first  orriirrence  of  a  variable,  a  deterministic  transducer  could 
not  know  in  general  whether  it  .slundd  output  T  or  F.  However,  it  also  could  not  wait  and  output 
a  truth-Vidue  later,  for  there  might  be  an  unbounded  number  of  occurrences  of  the  variable 

'  'rtut  not  in  the  sense  that  it  assigns  the  sj»nie  parses  to  the  strings  of  the  language,  where  a  parse  according  to 
a  tiniti'-slale  iii.'tchine  is  the  iie<]uencc  of  states  traversed  a  point  related  to  the  impossibility  of  ileterminising 
transducers. 

"‘This  statement  ignores  ajiy  stihtleties  having  to  do  with  the  processing  of  mills,  which  will  be  iliscussed 
later  (§6). 


before  there  was  sufficient  evidence  to  assign  the  trutli-value.  A  finite-state  transducer  would 
not  be  able  in  general  to  remember  how  many  truth-value  outputs  had  been  deferred. 


6.  The  Effect  of  Nulls 


Since  KiMMO  systems  can  encode  A'P-coniplete  problems,  the  (»<-neral  KiMMO  generation 
and  recognition  problems  are  at  least  as  hard  as  the  romputation;illy  difHoilt  problems  in 
A  P.  But  could  they  be  evcm  harder?  The  jinswer  dej)ends  on  wliether  mill  characters  are 
allowed,  11'  null  characters  ar<-  forbidden,  the  problems  are  in  M P ,  hence  (given  the  previous 
A' ?-liiirdness  result)  AP-coinpl<’te  (§0.1).  U' null  characters  are  completely  unrestricted,  the 
jirobleiiis  iu-e  I’SPAf^E-coniplete,  thus  potentiidly  even  harder  than  the  problems  in  M P  (§6.2), 
However,  the  full  power  of  unrc-stricted  null  characters  is  not  needed  for  hnguistically  relevant 
proce.ssing.  (Continuing  to  explore  the  effect  of  KiMMO  null  characters,  section  6.3  mentions  a 
subtle  point  with  computational  consequences  —  about  the  interpretation  of  the  KiMMO 
constriiint-intersection  operation  when  nulls  are  involved. 

6.1.  .A/P-Completeness  Without  Nulls 

The  generation  and  recognition  problems  for  KiMMO  automata  without  nulls  are  }JP- 
comjdete.  Since  section  4  showed  that  the  problems  were  M  P-haid,  edl  that  remains  is  to 
show  that  a  nondcterministic  machine  could  solve  them  in  polynomial  time.  Only  a  sketch  of 
the  proofs  wiU  be  given. 

Given  a  possible  instance  {A.  a)  of  KIMMO  GENERATION,  the  basic  nondeterminism 
of  the  macliine  can  be  used  to  giies.s  the  surface  string  corresponding  to  the  lexical  string  tr. 
The  automata  can  then  quickly  verify  the  correspondence.  The  key  fart  is  that  if  A  aUows  no 
nulls,  the  lexical  and  surface  characters  must  be  in  one-to-one  correspondence.  The  surface 
string  must  be  the  same  length  as  the  lexical  string,  so  the  sir.e  of  the  guess  can't  get  out  of 
hand.  (If  the  guess  were  too  large,  the  machine  would  not  run  in  polynomial  time.) 

Given  a  i)o.s3iblc  instance  {A.  D,a)  of  KIMMO  REGOGNITION,  the  meurhine  shotild 
guess  the  lexical  string  instead  of  the  surface  string;  as  before,  its  length  will  be  manageable. 
Now,  however,  the  machine  must  also  guess  a  path  through  the  dictionary.  The  number  of 
choice  points  is  limited  by  the  length  of  the  string,*®  while  the  number  of  choices  at  each  point 
is  hmited  by  t  he  number  of  lexicons  in  the  dictionary.  Given  a  lexical-surface  correspondence 
and  a  lexicon  jiath,  the  automata  and  the  dictionary  component  can  quickly  verify  that  the 
lexical /surface  string  pair  satisfies  edl  relevant  constraints. 

Wlirn  mills  arc  allowed  as  in  the  next  section,  the  machine  must  also  guess  where  to  insert  0  characters  into 
the  surface  string.  Because  of  the  way  the  automata  operate,  the  strings  tii.-it  are  submitted  to  the  automata 
for  vcrihcatioii  must  include  the  nulls. 

'"Nulls  in  the  lexicon  do  not  have  the  s.anie  interpretation  as  nulls  in  the  automata.  Nulls  should  not  occur 
in  the  (iiction.'iry,  except  in  "null  lexicon  entries”  that  ore  written  as  0  in  their  entirety.  Unlike  nulls  in  the 
automaton  component,  which  are  treated  os  genuine  characters  by  the  .iutomata,  null  lexicon  entries  are  merely 
n  notational  device  iuid  ran  be  removed  in  the  course  of  constructing  letter  trees  from  the  lexicons.  Thus  the 
number  of  choice  points  in  the  lexicon  data-structnre  is  limited  by  the  length  of  the  lexical  string  even  when 
nulls  are  permitted. 
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6.2.  PSPACE-Completeness  with  Unrestricted  Nulls 

If  nulls  arc  coini)letcly  unrestricted,  the  argunicnts  of  section  6.1  do  not  go  through.  The 
problem  is  that  unr<'8trirte<i  null  characters  allow  the  lexical  and  surface  strings  to  differ  wildly 
in  length.  The  time  it  takes  to  guess  or  verify  the  lexical-  surface  correspondence  may  no  longer 
be  polynomially  bounded  in  the  length  of  the  input  string. 

hi  fact,  it  is  ejisy  to  show  that  KIMMO  RECOGNITION  with  unrestricted  null  characters 
is  PSPACE-completc  —  at  Iciist  as  hard  as  iu»y  problem  that  ran  be  solved  in  polynomial  space. 
Though  the  question  is  open,  PSPACE-complete  problems  arc  likely  to  be  even  harder  than 
.V  j^-completc  problems. 

Not  only  is  a  PSPACE-complete  problem  not  likely  to  be  in  P  ,\t  is  also  not  likely  to 
be  in  iJ P .  Hence  a  property  whose  existence  question  is  PSPACE-coinplete  probably 
cannot  even  be  verijied  in  polynomial  time  using  a  polynomial  length  “guess."  (Garey 
and  Johnson,  1979:171). 

Thus  the  worst  ca.se  of  KIMMO  RECOGNITION  becomes  extremely  difficult  if  null  charac¬ 
ters  are  completely  unrestricted.  (Incidentally,  PSPACE  includes  such  problejiis  as  deciding 
whether  a  player  has  a  forced  win  from  certain  N  x  N  checkers  or  Go  configurations.*®) 

The  ea.siest  P.SPACE-completeness  reduction  for  KIMMO  RECOGNITION  with  unre¬ 
stricted  nulls  involves  the  computational  problem  FINITE  STATE  AUTOMATA  INTERSEC¬ 
TION  (Garey  and  Johnson,  1979:260).  A  possible  instance  of  FSAl  is  a  .set  of  deterministic 
finite-state  automata  over  the  same  alphabet.  The  problem  Is  to  determine  whether  there  is 
any  string  that  is  accepted  by  all  of  the  automata.  Given  a  :H't  of  automata  over  alphabet 
E,  construct  a  corresponding  KIMMO  RECOGNITION  probleni  as  follows.  Let  o  and  b  be 
new  characters  not  in  E,  and  take  the  KiMMO  alphabet  to  be  E  U  {o,6}.^®  Declare  =  as  the 
wildcard  character  and  0  as  the  null  character. 

Then  build  the  rest  of  the  automaton  component  in  two  parts.  First,  include  tlie  following 
“main  driver"  automaton; 

"Main  Driver"  3  3 

a  b  =  (lexical  eharactern) 

a  b  0  (gurjace  characters} 

1.  2  0  0  (want  a) 

2.  0  3  2  (let  automata  run} 

3:  0  0  0  (got  ab;  final  state} 

This  will  accept  the  surface  string  ab.  allowing  .irbilrary  h-xical  gyrations  between  a  and  6 
as  long  as  they  come  out  null  on  the  .surface.  Si-cond,  for  each  of  the  automata  in  the  FSAI 
problem,  translate  it  directly  into  a  KiMMO  automaton  by  p:iiring  the  origiiml  characters  from 
E  with  surface  nulls.  Also  add  c<dumns  for  a/a  luid  b/b.  with  entries  r.ero  iinless  otherwise 
specifi<d.  Bump  ;dl  of  the  stat«’  numbers  up  by  two.  L<'t  the  new  st.art  state  accept  only  a/a. 

'®A  few  restrictions  on  the  problenis  are  iieressary  in  order  to  show  iiieinhershil'  in  PSPACE.  For  details, 
i  C  llarey  and  Jolinson  ( 1979: 173.2r)Gf )  iuid  referemes  cited  therein. 

^®The  reduction  cjui  l>e  doni*  without  n  tuid  It,  t»ut  tliey  .ire  incliidt'd  l»ee.iuse  tlie  resultiiif*  reduction  is 
more  reiiiiniseent  of  ordin.iry  proerssiiiK  protdenis  in  which  tlie  <|ueMtion  arisi  s  of  how  niiuiy  null-'  to  hypothesize 
between  cliiu'arters. 


goinR  to  3  (the  old  start  state).  Let  only  state  2  be  a  fintd  state,  but  for  every  state  that  was 
final  in  the  original  automaton,  give  it  a  transition  to  2  on  6/6. 

Tliird,  let  the  root  lexicon  of  the  dictionary  component  contain  a  lexicon  entry  for  each 
single  character  in  D  U  {a.  6}.  The  continuation  cla-ss  of  each  entry  should  send  it  back  to  the 
root  lexicon,  except  that  the  entry  for  6  should  list  the  word-final  continuation  class  ^  instead. 
Finally,  take  ab  as  the  surface  string  for  the  KIMMO  RECOGNITION  problem.  Surface  a 
will  start  up  the  translated  versions  of  the  original  automata,  which  will  be  able  to  run  freely 
in  belwec'n  the  a  and  the  6  because  the  characters  in  E  all  get  paired  with  surface  nulls.  If 
there  is  some  string  that  all  of  the  original  automata  accept,  that  lexical  string  will  send  all  of 
the  translated  automata  into  a  state  when?  the  remaining  6  is  acceptJible.  On  the  other  hand, 
if  the  original  intersection  is  empty,  the  6  will  never  become  acceptable  and  the  recognizer  will 
not  accept  the  string  ab. 

This  construction  forms  one  half  of  the  PSPACE-completeness  proof,  but  it  is  also  nec¬ 
essary  to  show  that  KIMMO  RE(X)GNlTION  is  no  htuder  than  problems  in  PSPACE.  It 
is  suHicient  to  tran.sform  arbitrary  KIMMO  RECOGNITION  problems  into  FSAI  problems. 
(Jiven  a  recognition  problem,  first  convert  the  dictionary  component  into  a  large  automaton 
tliat  (i)  con.strains  the  lexical  string  in  the  same  way  the  dictionary  component  does,  pairing 
lexical  characters  with  surface  wildcards,  but  (ii)  allows  nulls  to  be  inserted  freely  at  the  lex¬ 
ical  level,  in  case  the  other  automata  permit  lexical  nulls.  The  conversicjji  can  be  performed 
because  the  dictionary  component  is  finite-state.  Second,  convert  the  input  string  into  an 
automaton  as  well.  The  input-string  automaton  should  (i)  constrain  the  surface  string  to  be 
exactly  the  input  string,  but  (ii)  allow  surface  mills  to  be  inserted  freely.  Third,  expand  out 
all  wildcard  and  subset  characters  in  the  automata,  then  interpret  each  lexical/surface  pair 
at  the  head  of  an  automaton  column  as  a  single  character  in  an  extended  alphabet.  Given 
this  preparation,  it  is  possible  to  solve  the  original  recognition  problem  by  solving  FSAI  for 
the  augmented  set  of  automata.  Since  the  input  string  is  now  encoded  as  an  automaton, 
the  intcrse<-tion  of  the  languages  accepted  by  all  the  automata  consists  of  all  the  permissible 
lexical  surface  corrcsiiondcnccs  that  reflect  recognition  of  the  input  string.  The  intersection 
will  be  nonempty  —  as  FSAI  tests  —  if  and  only  if  the  input  string  is  recognizable. 

The  PSPACE-completeness  proof  .shows  that  if  null  characters  are  completely  unrestricted, 
it  can  be  very  hard  for  the  recognizer  to  reconstruct  the  superficially  null  chfiracters  that  may 
lexi<-ally  intervene  betwe<‘n  two  surface  characters.  However,  unrestricted  nulls  surely  are  not 
nei'ded  for  linguistically  relevant  KiMMO  systems.  Processing  complexity  can  be  reduced  by 
iuiy  restriction  that  prevents  the  number  of  possible  nulls  between  surface  characters  from 
getting  too  large.  As  a  crude  approximation  to  a  reasonable  constraint,  the  above  reduction 
could  be  ruled  out  by  forbidding  entire  lexicon  entries  to  come  out  null  on  the  surface.**  A 
suitable  restriction  would  make  the  KiMMO  generation  and  recognition  problems  only  P- 
complete  rather  than  PSPAGE-complete. 

from  fontnnti*  18  that  an  entry  "0"  in  the  dictionary  is  not  the  same  as  a  dictionary  entry  that  is 
entirely  tleleted  at  the  surface  by  the  automata. 


6.3.  The  Intersection  of  Constraints 


Tlic  null  characters  (0)  that  can  appear  in  a  KiMMO  automaton  allow  the  recognizer  to 
advance  without  consuming  any  characters  from  the  input  word.  For  example,  in  analyzing  the 
word  hoed  as  hoe+ed,  tlie  automata  advance  as  if  the  surface  string  were  hoOOed  (sec  Karttunen 
and  Wittenburg,  1983:220),  postulating  surface  nulls  freely  as  rerjuired  by  the  constraints  of 
the  system.  However,  the  interjirctation  of  0  as  the  empty  string  involves  some  subtlety  when 
multiple  constraints  are  involved. 

Internal  to  a  KiMMO  automaton,  0  is  treated  the  same  as  any  other  character,  but  0  is 
effectively  deleted  at  the  iiiterfac<'  to  the  .surface  string  or  the  dictionary  component.  Abstractly 
speaking,  the  treatment  of  nulls  by  the  KiMMO  n-cognizer  involves  two  steps:  (i)  null  characters 
arc  inserted  freely  into  the  surface  .string  to  produce  a  form  like  hoOOed;  (ii)  tliis  augmented 
string  is  u.sed  to  run  the  automata.  Thus,  a  KiMMO  atitomaton  can  be  considered  to  define 
both  an  internal  constraint  (relating  the  augmented  strings  with  0  characters  inserted)  and 
an  external  constraint  (relating  the  strings  as  they  stood  before  0-inscrtion). 

This  distinction  becomes  important  when  there  is  more  than  one  automaton  in  a  KiMMO 
system.  The  notion  of  “satisfying  every  constraint”  could  refer  to  intersecting  cither  the 
internal  or  the  external  versions  of  the  constraints  defined  by  the  automata.  If  the  external 
languages  are  intersected,  different  automata  can  disagree  about  the  placement  of  nulls.  (This 
corresponds  to  interpreting  null  characters  as  ordinary  empty  strings  (epsilons,  t),  since  the 
number  of  occurrences  of  the  empty  string  between  any  two  characters  is  indeterminate.)  On 
the  other  hand,  if  the  internal  forms  of  the  constraints  are  intersected,  all  the  automata  must 
agree  on  the  number  of  nulls  and  their  positions. 

The  actual  KiMMO  system  performs  internal  inter.section  of  the  constraints  defined  by  the 
automata.  Ron  Kaplan^*  has  pointed  out  that  this  .subtle  distinction  in  the  interpretation  of 
KiMMO  nulls  has  computational  con.sequences.  If  tlie  various  constraints  of  a  KiMMO  system 
were  subject  to  external  rather  than  internal  intersection,  thus  interpreting  KiMMO  nulls  as 
ordinary  epsilons,  then  DIGMACHINE  precompilation  would  not  be  generally  possible. 

Since  BIGMACHINE  precompilation  produces  a  .single  large  finite-state  transducer  as  out¬ 
put,  the  intersection  operation  that  it  implicitly  implements  must  always  map  finite-state 
constraints  into  finite-state  constraints.  External  intersection  does  not  have  this  property,  and 
therefore  BIGMACHINE  precompilation  would  not  be  generally  possible  if  external  intersection 
were  used.  Specifically,  Kaplan  has  called  attention  to  the  following  finite-state  relations  over 
lexical-surface  pairs: 

A  =  (a/b)-(0/c)- 

and  D  =  (0/b)*(a/c)* 

Each  of  these  relations  is  ca.sy  to  encode  in  a  KiMMO  automaton,  but  their  external  intersection 

AnD=  {a"/b"c"} 

cannot  be  defined  by  any  KiMMO  automaton,  large  or  small,  despite  its  finite-state  origins. 

rtrnarkii  w<t«*  iiiaclr  in  a  talk  to  th<*  Wt>rk«ln>p  on  Finite-State  Morpliology,  Center  for 

the  Study  of  Lmif^imge  and  Inff)rinatKm,  Sfaiifi>rd  University,  July  29  30,  1985. 
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Tills  example  makes  crucial  use  of  the  fart  that  external  intersection  allows  different 
automata  to  disat'rce  about  the  placement  of  nulls;  under  internal  intersection  (e.g.  in  the 
current  KiMMO  system)  no  nontriviid  lexical-surface  pair  satishes  both  of  the  constraints.  For 
instance,  A  will  reject  the  external  string  pair  aa/bbcc  except  as  aaOO/bbcc,  while  B  will 
reject  it  except  as  OOaa/bbcc.  Since  internal  uitersection  requires  all  automata  to  agree  about 
the  placement  of  nulls,  aa /bbbb  will  be  rejected  under  internal  intersection. 

The  computational  conseqticnces  of  the  distinction  between  internal  and  external  inter¬ 
section  become  more  severe  when  KlMMO  systems  »ire  generidised  slightly.  For  example,  if 
KIMMO  automata  are  generalized  to  use  thrw  levels  instead  of  two,  and  if  certain  other  small 
changes  are  made,  tlx'n  the  recognition  problem  becomes  computationaly  undeeidable  tmder 
external  intersection  (Barton,  1985b). 


7.  Improving  KiMMO  Dictionary  Efficiency 

One  final  matter  remains.  Despite  the  fact  tlint  luiviKation  throiif^li  the  lexicons  of  the 
dictionary  component  can  account  for  quite  ahit  of  backtrackiiiR  in  tlic  current  KiMMO  system, 
the  previous  sections  pave  little  attention  to  that  prohlein.  Instead,  section  3.3.2  proiiii.sed  that 
the  dictionary  component  could  be  changed  in  such  a  way  that  most  of  the  clioice  points  would 
be  eliminated.  This  section  explains  how. 


7.1.  Subdivisions  of  the  Dictionary 

Naturally,  there  would  be  no  need  to  choose  among  idternative  lexicons  if  the  dictionary 
were  not  subdivide*!,  hi  tlie  existing  KiMMO  system,  subdivisions  fue  nec-iled  for  two  reasons. 
First,  the  continuation-class  mechanism  is  the  only  means  for  expressing  co-occurrence  restric¬ 
tions  among  roots  and  affixes,  and  a  continuation  class  is  a  set  of  lexicons.  Second,  incorrect 
dictionary  s<!arch  paths  can  be  recognized  and  prum-d  more  quickly  when  suffixes  are  stored 
separately  from  roots. 

The  existuig  continuation-class  mechanism  makes  the  lexicon  the  finest  unit  of  discrimi¬ 
nation  between  suffixes.  If  a,z.y  are  dictionary  entries  such  tliat  the  sequence  ax  is  possible 
but  ay  is  not,  this  constraint  will  be  inipos.siblc  to  capture  unless  x  iuid  y  ;ire  listi'd  in  separate 
lexicons;  if  they  are  in  the  same  lexicon,  it  will  be  impossible  for  the  continuation  class  of 
a  to  include  x  but  not  y.  Thus  the  need  to  expr<-ss  co-occurrence  restrictions  leads  to  the 
use  of  multiple  lexicons.  For  exainjde.  Kiirttunen  and  Wittenburg  (l'J83.22'l)  must  fist  -ed 
eind  *er  in  separate  lexicons  bt'cause  of  such  contrasts  as  doer/*doed.  In  the  sp«'cial  case 
of  separated  dependencies,  the  weakness  *»f  the  current  coiitinuation-cla.ss  mechanism  leatls 
to  a  large  amotint  of  duplicfited  structure  in  the  muhii)le  lexicons  that  must  he  constructed 
(Korttunen,  1983:180). 

Small  lexicons  are  also  advantageous  for  pruning  se.irch.  since  it  can  be<  ()nie  .ipparent 
very  early  that  no  acrejitable  suffix  starts  out  with  the  letters  at  li.iiid.  For  iustam  e,  if  none  of 
the  suffixes  that  can  attach  to  the  current  word  start  with  a.  it  is  pointless  to  search  beyond 
an  a  in  the  input  (ignoring  spelling-chtuige  rules  here).  If  the  h  gal  sutlixes  tor  t)i»'  inrrent 
class  of  word  are  stored  in  a  separate  lexir«)n.  the  letter-tree  version  i.f  tlie  lexicon  will  not 
be  st’arched  beyond  an  a.  Howt'ver,  if  they  are  listed  with  many  other  sulfixes  su<  li  as  -able, 
the  search  will  not  be  aborted  until  later  po.ssibly  not  until  tin’  end  of  a  .■-ullix,  when  the 
combinatory  features  of  the  suffix  can  be  checked. 

Unfortunately,  midti])le  lexicons  slow  amilysis  <low)i  quite  a  bit  m  'be  enrr*'nt  version 
of  KIMMO.  Each  of  the  lexicons  in  a  continuation  class  is  se.iri  In d  si  p.ir.iti  ly  The  first  few 
characters  beyond  alexicon  choice  point  tend  toget  reaiialv/ed  .severa!  times  with  that  portion 
of  the  lexical  surface  corrc'spondence  worked  out  tdresh  e.u  h  time  If  r.  i/  .ibove  are  stems  (N. 
V,  etc.)  instead  of  sulfixes  that  is,  if  u  is  a  prefix  t  In  n  l  he  rmit  lexn  on  In  i  ome- snlnlivided 
In  such  !i  situation,  the  sepiirate  seiircliing  of  the  dilfi  ren*  portions  ot  tin  root  lexn  on  becomes 
especially  sc-rious.  Much  storage  is  idso  w.i.sied  (Karttum  n  and  \\  itti  ntana  I'IS.I  2'21f) 

In  some  cases,  howi’ver.  the  current  finite-slate  li  xn  on  -i  rm  t  nre  i  .mm  a  .ii'nri  tin  po.p  r 
co-occurrence  restrict lotis  ev<  ii  if  <lupii(  aliou  and  inelheiein  y  i  ii  In  tol.  i.ii.  l  I’rel  m  •  i-  n 
eraUy  apply  only  to  words  of  {tarticuiar  cl.us.-.es.  thus  making  it  !ieee--,it\  i,,  li.ui  -•  p.ii  I'l 


33 


lexicons  for  the  viuious  chisses  of  words  involved.  But  since  prefixes  and  suffixes  can  pro¬ 
ductively  form  new  words  of  vtuioiis  classes  (for  instance,  -ize  fo|ins  verbs),  it  may  not  be 
possitde  for  a  lexicon  to  list  them  .dl.  Formally  spi-akiiiR,  if  both  jiretixes  and  suffixes  (i)  are 
fully  productive,  (ii)  can  chanpe  the  cateR«ri<-s  of  words  arbitrarily,  luid  (iii)  c.ui  attach  to  only 
particular  cateRories  of  words,  then  sepiirated  dependencies  can  arisi'  that  exceed  the  power  of 
a  finite-state  lexicon  structure.  In  .such  cases,  context-fn-e  rules  of  some  kind  might  be  better 
suited  to  till'  hierarchical  word-structures  that  are  involved.  Alteriiiitively,  it  might  be  prefer¬ 
able  to  subdivide  the  problem  by  enforcing  oidy  crude  finite-state  combinatorial  constraints 
while  figuring  out  the  lexical-surface  correspondence,  then  filteriijg  the  analyses  in  a  more 
sophisticated  way  afterward. 


7.2.  Merging  the  Lexicons 

The  number  of  .separat<'  lexicon  searches  can  obviously  be  reduced  if  there  is  only  one 
lexicon.  Roots  ;uid  affixes  can  idl  be  hsted  together,  with  the  combinatory  possibilities  of 
various  elements  indicated  by  a  feature  system.  Such  a  feature  system  can  be  used  whether  or 
not  the  existing  finiti'-state  dictionary  framework  is  rejilaced  with  something  more  powerful. 

Witliin  the  existing  framework,  each  lexicon  mune  can  be  interpreted  as  a  feature;  the 
(imtinuation  class  of  each  entry  is  then  taken  to  specify  the  possible  lexicon  features  of  its 
immediate  siicci'ssor  in  the  wor<l.  Alternatively,  a  more  powerful  framework  might  be  modeUed 
idler  the  Ii;igui.sti<  frmnework  of  Liebcr  (1980).  (lontext-free  machinery  of  some  kind  could 
mijilement  the  recovery  of  hiertuchical  structure,  the  aiiplication  of  Lieber's  feature-percolation 
conventions,  and  th<'  enforcement  of  combinatory  restrictions.  Common  grammar-processing 
te<  hniques  could  be  used  to  predict  at  each  boundary  the  set  of  permissible  combinatorial 
features  (tlie  continuation  class)  of  the  next  segment  of  input. 

As  noted,  however.  m<'rging  the  lexicons  in  this  way  has  the  disadvantage  that  it  prolongs 
.soiiK'  dictionary  searches  that  would  have  failed  early  with  more  finely-divided  lexicons.  At 
nuxh’st  cost  in  time  and  space,  this  disadvantage  can  be  eliminated  by  adding  bit  vectors  to 
the  mt<-rnal  letter-tree  form  of  the  lexicon.  The  bit  vector  associated  with  a  hnk  in  the  letter 
tree  indicates  which  classes  of  words  or  affixes  can  be  found  in  the  subtree  below.  Bit  vectors 
siioiild  also  be  associated  with  the  outputs  of  the  tree. 

The  bit-vector  scheme  makes  it  possible  to  .search  in  parallel  through  all  of  the  lexicons  in 
a  continuation  class.  The  imph-mentation  will  no  longer  interpret  a  continuation  class  in  terms 
of  the  imlividual  letter-trees  of  several  lexicons;  instead,  a  continuation  class  will  correspond 
to  iui  encoded  set  of  lexicon  names  for  use  in  descending  the  single  merged  letter-tree.  Before 
(lescemling  <a  branch  (or  using  an  output),  it  is  necessary  to  check  whether  there  is  a  non-null 
mter.-ection  between  the  h-xicons  coiiiprisiug  the  desired  continuation  class  and  the  lexicons 
accessible  down  the  branch.  On  imuiy  computers,  this  test  can  be  carried  out  in  a  .single 
iiistrtjction.  if  the  number  of  lexicons  in  the  dictioimry  is  small  (e.y.  <  32).  Search  should 
terminate  if  the  intersection  is  null.  With  the  ‘'virtu.'d"  split  lexicons  provided  by  the  bit-vector 
SI  heme,  a  fiuliiig  search  can  termin.'ite  just  as  early  in  the  lexical  string  as  it  will  with  lexicons 
th.it  have  individmd  letter-trees:  Figure  IG  shows  tin  idealir.erl  illustration.  In  an  actual  system, 
till-  dictiomu'y  would  have  more  fijiely  divided  lexicons  than  N  fuid  V.  ('specially  for  .suffixes. 

An  implementation  of  this  dictionary  scheme  was  used  to  generate  the  tract's  shown  in 
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Fifinro  16:  If  soparate  letter  trees  for  nouns  iind  verbs  are  merged  :is  on  the  left,  failing  searches  may 
be  prolonged  unnecessarily.  Assuming  that  no  nouns  are  accessible  down  the  kil.  .  .  branch  of  the 
merged  tree,  it  is  u.seh'ss  to  traverse  that  branch  if  only  a  noun  is  acceptable  in  the  current  context. 
However,  the  fruitlessness  of  the  branch  may  not  be  apparent  until  the  end  of  an  entry  (c.g.  kill) 
is  reached  and  category  features  are  aviiilable.  In  the  letter  tree  on  the  right,  e.ach  link  has  been 
augineiiteii  with  a  bit-vector  that  indicates  the  classes  of  entries  th.at  !«■<•  accessible  down  the  link. 
The  l)it-v<'riors  enable  the  sy3t<'m  to  terminate  a  failing  seiirch  without  going  any  further  down  the 
tree  than  it  would  with  uniiierged  lexicons,  hi  this  case,  the  kil.  .  .  subtree  would  not  be  searched 
because  the  intersection  of  {V}  and  {N}  is  noil. 


/■.V. 


Figure  3  and  succeeding  figtires.  Without  the  merged  dictionary,  the  rccogniter  for  English 
locates  a  suffix  in  the  continuation  class  /V  by  doing  a  separate  letter-tree  descent  for  each  of 
the  lexicons  P3,  PS,  PP,  PR,  I.  AG,  and  AB.  With  the  merged  dirtionary,  the  recognif.er  needs 
only  one  letter-tree  descent  in  tJie  virtual  lexicon  (/V)  =  {P3.  PS,  PP,  PR,  I,  AG},  thus  reducing 
the  number  of  steps  needed  to  analyze  an  input.  Finely  divided  lexieons  (hence  continuation 
classes  with  several  members)  are  typically  necessary  for  capturing  co-occurrence  restrictions 
even  in  approximate  form,  and  consequently  the  merged  dirtiojiary  almost  always  .speeds  up 
recognizer  operation.  Finally,  even  though  it  takes  extra  space  to  augment  links  and  outputs 
with  bit-vectors,  the  merged  dictionary  ran  also  save  .spare  by  sharing  structure  fimong  what 
would  otherwise  be  separate  letter  trees.  |  | 


35 


■r.  v.  j”.  • 


.V. 


.*  >  v,  s'. 


8.  References 


Barton,  E.  (1985a).  “On  the  Complexity  of  ID/LP  Parsing,”  A. I.  Memo  No.  812,  M.I.T. 
Artificial  Int<'lligcncc  Laboratory,  Cambridge,  Mass. 

Barton,  E.  (1985b).  “Intractability  in  Finite  State  Machinery,”  A. I.  Memo  No.  878,  M.I.T. 
Artificial  Intelligence  Laborat<»ry,  Cambridge,  Mass.  (Forthcoming;  tentative  title.) 

Berwick.  R.,  and  A.  Weinberg  (1982).  “Parsing  Efficiency,  Com]>utational  Complexity,  and 
the  Evaluation  of  Crarnmatical  Theories,”  Linguistic  Inquiry  13.2:165-191. 

Clements.  C.,  and  E.  Sezer  (1982).  “Vowel  and  Consonant  Disharmony  in  Turkish,”  in 
van  der  Ilulst  iuid  Sinitli  (1982b:213-256). 

(Jajek.  ().,  H.  Berk,  D.  Elder,  Jind  C.  Whittemore  (1983).  “LISP  Implementation  [of  the 
KIMMO  system],”  Texas  Linguistic  Forum  22:187-  202. 

Carey,  M.,  and  D.  John.son  (1979).  Computers  and  Intractability.  San  Francisco:  W.  H.  Free¬ 
man  and  Co. 

Hale,  K.  (1982).  “Some  Essential  Features  of  Warlpiri  Verbal  Clauses,”  in  Swartr,  (1982:217- 
315). 

Karttunen,  L.  (1983).  “KiMMO:  A  TwoLevel  Morphological  Analyzer,”  Texas  Linguistic 
Forum  22:165-186. 

Karttunen,  L..  and  K.  Wittenburg  (1983).  “A  Two-Level  Morphological  Analysis  of  English,” 
Texas  Linguistic  Forum  22:217-228. 

Kay.  M..  and  R.  Kaplan  (1982).  “Word  Recognition,”  unpublished  draft  ms.  dated  May  1982, 
Xerox  Palo  Alto  Research  Center,  P.alo  Alto,  California. 

Lieber,  R.  (1980).  On  the  Organization  of  the  Lexicon.  Ph.D.  thesis.  Department  of  Linguistics 
iuid  Philosophy.  M.I.T.,  Cambridge,  Mass. 

Lindsledt.  J.  (1984).  “A  Two-Level  Description  of  Old  Church  Slavonic  Morphology,”  Scando~ 
Slavica  30:165-189. 

McCarthy,  J.  J.  (1982).  “Prosodic  Templates,  Morphemic  Templates,  and  Morphemic  Tiers,” 
in  van  diT  Hulst  and  Smith  (1982a:191-223). 

Nash.  D.  (1980).  Topics  in  Warlpiri  Grammar.  Ph.D.  thesis.  Department  of  Linguistics  and 
Philosophy,  M.I.T.,  Cambridge,  Mass. 

Poser.  W.  (1982).  “Phonological  Representation  and  Action-At-A-Distance,”  in  van  der  Hulst 
luid  Smith  (1982b:121  158). 

.'^wart/..  S..  ed.  (1982).  Papers  in  Warlpiri  Grammar  in  Memory  of  LoP.ar  Jagst.  Work-Papers 
of  SIL-AAB.  Series  A,  Volume  6.  Summer  Institute  of  Linguistics,  BerrimaJi,  N.T. 

T'tiderhill.  R.  (1976).  Turkish  Grammar.  Cambridge.  M.iss.:  M.I.T.  Press. 

v.'ui  der  Hul.st,  IL,  and  N.  Stnith.  eds.  (1982a).  The  Structure  of  Phonological  Representations, 
Part  I.  Dordrecht.  Holland:  Foris  Publications. 

van  der  Ibilst.  H..  ;uid  N.  Smith,  eds.  (1982b).  The  Structure  of  Phonological  Representations, 
Part  II.  Dordrecht,  Holland;  Fori.s  Publications. 


